Accelerating Distributed SGD With Group Hybrid Parallelism

The scale of model parameters and datasets is rapidly growing for high accuracy in various areas. To train a large-scale deep neural network (DNN) model, a huge amount of computation and memory is required; therefore, a parallelization technique for training large-scale DNN models has attracted attention. A number of approaches have been proposed to parallelize large-scale DNN models, but these schemes lack scalability because of their long communication time and limited worker memory. They often sacrifice accuracy to reduce communication time. In this work, we proposed an efficient parallelism strategy named group hybrid parallelism (GHP) to minimize the training time without any accuracy loss. Two key ideas inspired our approach. First, grouping workers and training them by groups reduces unnecessary communication overhead among workers. It saves a huge amount of network resources in the course of training large-scale networks. Second, mixing data and model parallelism can reduce communication time and mitigate the worker memory issue. Data and model paralleism are complementary to each other so the training time can be enhanced when they are combined. We analyzed the training time model of the data and model parallelism, and based on the training time model, we demonstrated the heuristics that determine the parallelization strategy for minimizing training time. We evaluated group hybrid parallelism in comparison with existing parallelism schemes, and our experimental results show that group hybrid parallelism outperforms them.


I. INTRODUCTION
Recently the deep-learning technique has received considerable attention for application in various areas such as medical imaging, space imaging, and VR/AR imaging. These areas often require high accuracy, so the scale and complexity of deep-learning models and the amount of training data are becoming immense [1]- [3]. Accordingly, distributed stochastic gradient descent (SGD) should be considered for training large-scale DNN models. However, scaling distributed SGD is challenging [4]. Large-scale DNN models have huge model parameters and huge amount of activation data, which leads to long communication delays and lack of device memory to store them [5]. Thus, the key challenges of scaling distributed SGD for large-scale DNN is how to reduce the long The associate editor coordinating the review of this manuscript and approving it for publication was Daniel Grosu . communication time, and how to resolve the device memory limitation.
Two well-known parallelisms of training distributed SGD are data parallelism and model parallelism, but both parallelisms have limitations for training large-scale DNN models in terms of scalability and accuracy [4], [6]- [8]. Data parallelism is a parallelism method of replicating model parameters to workers, and distributing data batches equally to workers [3]. It is frequently used for distributed training due to its high worker utilization and simplicity. However, data parallelism shows low scalability when training large-scale DNN models [3], [9]. Training data and model parameters cannot be stored in fixed worker memory when training a large-scale DNN model [10], [11]. Also, model parameters should be synchronized among workers at the end of training iteration, and it is a major bottleneck of distributed training [12], [13]. When we apply allreduce operation VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to N compute nodes, the synchronization time can be represented by O(logN ) or O(N ) contingent on the network topology, while the computation time can be represented by O(1/N ) [14]. In practice, the synchronization time takes more than 90% of the total training time as the number of workers increases over 32, which leads to poor scalability [15]. Moreover, existing schemes about shortening the synchronization delay often sacrifice accuracy. Model parallelism is the parallelism method of partitioning the entire training model into submodels [3]. Model parallelism does not require synchronization of parameters, and it resolves the worker memory limitation issue, but it also provides low scalability because of low worker utilization and communication time of exchanging activation data. Because activation data increases as batch size increases, communication time should be reduced during training of a large-scale DNN.
Additionally, other challenging issues in model parallelism is how to partition a large model and allocate workers to minimize training time [3]. Large-scale DNN models should be partitioned into multiple submodels, but this is not a trivial task because the number of partitioning cases increases as the number of layers increases. Currently, model partitioning and device placement are done by human experts or through the reinforcement-learning approach [16], [17]. Therefore, automated model-partitioning and worker-allocation schemes are needed to minimize the training time [8].
In this paper, we propose a fast and scalable parallelism method of distributed SGD called group hybrid parallelism (GHP) for training large-scale DNN models. The key idea is that dividing workers into groups reduces activation size, so it mitigates both the communication time and device memory limitation, which arise during the training of a large-scale DNN model. In addition, we focused on the fact that the existing parallelism method lacks scalability for different reasons, and the methods are complementary to each other. To reduce the training time, we utilized both the data and model parallelism methods and balanced them. Moreover, we propose a scheduling scheme that automatically partitions a large-scale training model into submodels and allocates workers to each submodel to minimize the training time in a distributed environment. To do that, we propose a mathematical model to estimate the training time of distributed training. Our model has a limitation in that it does not accurately predict the training time as it does not consider all the parameters for estimating the training time, but it reflects the trend of training time for given configuration and shows acceptable error (within 15∼20% of the error range) so it can be used for scheduling.
We expect that our work can be applied to cluster management for distributed training because our work minimizes training time with fixed resources.
We make the following contributions in this paper: • We present an accurate mathematical model of training time when data, model, or hybrid parallelism is applied. We analyze the behaviors of distributed training of DNNs, and categorize the training time by computation, communication, and synchronization time. We prove that our proposed model clearly estimates the training time (about 15% of MAPE value) by experiment.
• We propose and evaluate parallelism scheme for fast and scalable training of large-scale DNNs. Our scheme optimally balances the data and model parallelism to minimize the training time and groups workers for scalability. We compare our work with other solutions and find out that our work outperforms others in terms of scalability and throughput.

II. RELATED WORKS AND PROBLEM DESCRIPTION
In this section, we overview parallelism methods for training distributed SGD, and discuss their problems for large-scale DNN models in terms of scalability and accuracy. We also review model partitioning schemes for model parallelism.

A. TRAINING LARGE-SCALE DNN MODELS
The trend of DNN models has two directions. In some studies, model parameters are becoming smaller for memory saving and power efficiency with acceptable performance. In other studies, model parameters are becoming deeper and wider to achieve high accuracy [18]. In these cases, the scale of datasets is also becoming immense. In this work, we are concerned with the second scenario with large batch sizes. When the batch size increases, the hardware utilization also increases, and the number of iterations for training decreases, so the training time is accelerated [19]. However, a large batch size reduces accuracy so it should be mitigated [3].
To train the large-scale DNN model, distributed training with multiple GPUs has gained attention [4]. Many DL frameworks support distributed training to handle large-scale DNN models. However, distributed training is a challenging issue because it includes many issues such as parallelization, model consistency, communication, optimization, hyperparameter tuning, and so on. Moreover, computing resources can be shared among multiple users to reduce costs. In multitenant case, we should consider communication, worker scheduling, worker placement, and fairness to enhance job completion time and GPU cluster utilization. Krizhevsky [20] suggested linear scaling heuristics to increase the learning rate linearly as the batch size increases. In other words, if the training rate is set to η when the training batch size B, then the training rate will be kη when the training batch size increases to kB, while other parameters, such as weight decay, are maintained. Goyal [21] proposed a warmup method that enhances linear scaling heuristics to linearly increase the learning rate from zero to the target value kη after 5 epochs. With these heuristics, it is reported that 8K and 5K batches were successfully trained for RESNet50 and RESNet101, respectively, without any accuracy loss [21].
On the other hand, Google recently presented the concept of federated learning and it has become the emerging paradigm for distributed training which enables global model training across decentralized devices with local data such as mobile devices and sensors [22], [23].
McMahan [23] presented the concept of FedAvg by aggregating the model parameters from local devices. Federated learning is communication-effective and secure as it does not sends the local data to the central server, instead it sends model parameters at each epoch. Also, it can utilize each device's on-device performance for training [24].

B. PARALLELISM METHODS FOR TRAINING LARGE-SCALE DNNs
There have been various attempts to train DNNs in distributed computing environments due to the increase of data and computation requirements. The Google Brain Project proposed data parallelism and model parallelism techniques for distributed learning [25], [26]. Each parallelism method has trade-offs, so we should determine which parallelism scheme is better considering hardware constraints. We will discuss the problem of data and model parallelism when applied to train a large-scale DNN model and summarize existing works about that issue.
Data parallelism is mainly utilized as a parallelism method due to its simplicity and high worker utilization. However, data parallelism has limitations for application to large-scale DNN models due to the long synchronization time of model parameters and limitation of minibatch size [8], [10]. There are two methods to solve this problem, namely, relaxation of synchronization conditions and compression of model parameters.
Relaxing synchronization condition is a method of reducing synchronization overhead by training with stale parameters rather than training strictly synchronous parameters. The existing data parallelism method mainly utilizes the allreduce-based synchronous SGD (S-SGD) method in which all workers train with the same parameters [3]. Dean [26] proposed a parameter server-based asynchronous SGD (A-SGD) method that reduces the synchronization time by allowing workers to train with different parameters. A-SGD greatly reduces communication delay and enables fast convergence, but the staleness problem occurs as a result of training with stale parameters [29]. To solve this, a stalesynchronous technique has been proposed which penalizes workers training with stale parameters to limit the staleness effect [27], [28].
Compressing model parameters is a method of pruning model parameters to reduce the synchronization delay. Alistarh [31] proposed a gradient quantization method for expressing parameters in only a few bits. Lin [32] proposed a gradient compression scheme based on sparsification to reduce the parameter size. Parameter compression greatly reduces the gradient size and improves the communication but it sacrifices accuracy as well [31]. Several works [32], [33] reduced model parameter size without sacrificing accuracy for some environments. However, whether they can support general cases has not been proved yet. Moreover, reducing parameter size does not fundamentally solve the problems due to model complexity and huge computations. Model parameter compressing schemes are orthogonal to the scheduling scheme that we are concerned about so they can be applied simultaneously.
Model parallelism is not frequently used as a parallelism method because of its low worker utilization, long communication time, and the need for a model-partitioning methodology. However, model parallelism has recently attracted attention to solve the problem of data parallelism as the scale of training models has increased [3], [4]. In particular, pipeline parallelism, an advanced form of the model parallelism method that resolves the problems of low utilization and long communication time through pipelining, has been proposed. Pipeline parallelism divides the entire training model into ''stages'' with the same number of workers and assigns workers on each stage. By pipelining data, the communication time is reduced and worker utilization is increased. However, pipeline parallelism also lacks scalability. Because the number of partitions (stages) is equal to the number of workers, pipeline parallelism has limited scalability as the number of workers increases. Moreover, the communication time cannot be hidden by overlapping as the batch size increases. Also, some pipeline parallelism schemes sacrifice accuracy [15].
Harlap [15] presented PipeDream as an asynchronous pipeline parallelism framework. PipeDream partitions layers into multiple stages, where each stage contains sequential layers. Each stage is mapped to a separate worker for training. PipeDream proposes a one-forward-one-backward (1F1B) scheme, which runs feed-forward, and backpropagation of the previous run alternately and stores multiple versions of gradients. It greatly increases worker utilization because workers do not have to wait for backpropagation, and this reduces activation communication time because communication can be overlapped with computation time. However, PipeDream is not scalable for two reasons. First, the communication time is long for the training of large-scale networks. PipeDream resolves the communication time issue by overlapping communication time and computation time, so the training time is the maximum value of the total computation time and total communication time. However, for a large-scale DNN, the communication time is long compared to the computation time; therefore it cannot be hidden by overlapping. Second, because PipeDream stores multiple versions of gradients, it wastes worker memory so it cannot handle large networks and batch sizes. Also, PipeDream reduces accuracy because it suffers from the staleness issue. It trains previous parameters so the accuracy is deteriorated.
Huang [34] provided a synchronous pipeline parallelism library named GPipe. GPipe focuses on maintaining high accuracy while partially solving the low-utilization issue of model parallelism. GPipe partitions a model into multiple stages in a way that divides the load of the entire model evenly. Then, GPipe assigns different partitions to different workers. To increase worker utilization, GPipe divides minibatches into microbatches and trains the microbatches in a sequential order to enable seamless training and increase worker utilization without any staleness problem. At the end VOLUME 9, 2021 of the iteration, GPipe synchronizes the gradients of microbatches. However, GPipe still has idle time in the training process, which lowers the throughput. Actually, the throughput of GPipe is only 29% compared to PipeDream [15]. For a large-scale network, GPipe cannot partition the entire model into stages because the solution space is too large. Moreover, GPipe has a load-balancing issue among partitions; therefore, GPipe is not scalable. However, GPipe does not reduce accuracy while increasing worker utilization.
Hybrid parallelism is a parallelization method that utilizes both data and model parallelism because they are complementary to each other. Data parallelization alleviates the low worker efficiency and activation communication time, which are characteristics of model parallelism, and model parallelization resolves the long synchronization time problem. Therefore, when the two parallelization methods are optimally mixed, the training time and scalability will be enhanced compared to that of pure data or model parallelism.
Gholami [9] proposed an effective parallelism scheme that utilizes model, data, and domain parallelism. The key idea is to minimize the communication time by mixing model, data, and domain parallelism, which is the main bottleneck of the training procedure. It considers every spectrum between pure data parallelism and pure model parallelism with a precise mathematical model. Gholami showed that balancing data and model parallelism can reduce the training time for even large-scale networks. However, there is no exact way of finding the optimum balance of data and model parallelism.
Jia [35] proposed layerwise parallelism. It analyzes the tensor structure of each layer and finds the parallelization scheme most suitable for a specific layer. A cost model and a graph-partitioning algorithm are applied to find the optimal solution.

C. WORKER ALLOCATION PROBLEM IN DISTRIBUTED DEEP LEARNING
The worker allocation problem (or device allocation problem) in distributed deep learning concerns how workers are allocated to a training model through scheduling [4]. When data parallelism is applied, a training model can be trained by simply dividing the entire batch evenly among workers with the entire model parameters. However, as we discussed above, model or hybrid parallelism should be considered to train a large-scale deep learning model. To apply model or hybrid parallelism, a methodology for partitioning a model into multiple submodels must be established. In other words, the worker allocation problem in distributed deep learning is a matter of determining how to partition the model into submodels and determining which workers to place in which submodel. We summarize the existing solutions about worker allocation and model partitioning schemes in distributed deep learning.
A reinforcement learning-based (RL-based) model partitioning scheme automatically obtains a near-optimal solution of model partitioning and device allocation regardless of model characteristics. Mirhoseini [16] proposed a device placement method to determine the optimal model partition and device allocation that minimizes computation time based on reinforcement learning and policy gradient method in heterogeneous environments. The RL-based scheme provides a faster training time compared to human experts' placement. However, these schemes need a pre-training procedure for scheduling and it is time and resource intensive. It is reported that pre-training takes about 12 to 27 hours for each training model. Also, it does not consider data parallelism or pipelining for acceleration so the hardware utilization is low. Moreover, the RL-based model partitioning method cannot be applied to models with a lot of computation, such as RNNLM and Inception-v3. To overcome this issue, Mirhoseini [16] subsequently introduces a hierarchical method to train more complex models.
Heuristics-based model partitioning methods are greatly influenced by network characteristics and scheduling techniques. They are empirical so they are easy to understand and implement, and they can consider various parallelism methods. Krizhevsky [20] and Wu [37] proposed a layer separation scheme named one weird trick (OWT) based on the smart insight that the convolution layer has the most computation, while the fully-connected layer has the most parameters. Their method decouples the convolution and fully-connected layer and places workers on each submodel in a data-parallel manner. This method is intuitive and easy to implement; however, it cannot be generally applied to any models and datasets because the dependencies of the model, network, and system hardware should be considered. Moreover, these schemes are not scalable when the dataset becomes large so they are not suitable for training large-scale models.

D. CHALLENGES IN HYBRID PARALLELISM FOR TRAINING LARGE-SCALE DNN
We present three challenging issues involved in training a large-scale DNN model by applying hybrid parallelism and summarize our approach to tackle these issues.

1) TRAINING ACCELERATION THROUGH WORKER ALLOCATION
It is important to balance data and model parallelism optimally as the training time greatly varies by the system configuration [9], [38]. Currently, there is no specific methodology for balancing them due to its large solution space and need for a model partitioning scheme. Therefore, most existing systems use only data parallelism as a parallelism method due to its simplicity. A scheduling algorithm that finds efficient model partition and worker allocation that minimizes the training time of hybrid parallelism is challenging.

2) GUARANTEEING SCALABILITY WITHOUT ACCURACY LOSS
As mentioned in Section 2.A, large-scale DNN models are used to achieve high accuracy. However, these models are limited due to long training time and worker memory problems. The methodologies for overcoming this, as described in Section 2.B, have limitations for training large-scale models or datasets, or even result in accuracy loss. The accuracy drop fades the meaning of a large-scale training model to increase accuracy. Therefore, there is a need for a training methodology that is capable of training large-scale DNN models, without any loss of accuracy.

3) FULLY AUTOMATED TRAINING FRAMEWORK
Most of the existing frameworks support data parallelism, and model or hybrid parallelism is partially supported. In addition, to use the above framework, the system configuration such as model partition and worker allocation must be manually set, and data propagation or synchronization between specific workers must also be manually implemented. In addition, the existing worker scheduling schemes often require additional information, such as the number of partitions, because of the large solution space. Therefore, a framework that enables automated training when given basic information, such as training model, batch size, and network configuration is needed.
In this work, we proposed a novel parallelism method to solve the above three issues. Our work targets scalability and fast training time so it works faster than other schemes even when the number of workers, batch size, model size is large. The main idea of our scheme is to aggregate workers into R groups for scalability. Through aggregation, the activation data size is reduced by R times so it can work faster even when the batch size is large. Also, each group trains the network by utilizing both the data and model parallelism for faster training.
As described earlier, existing worker allocation schemes do not provide a systematic analysis of determining the parallelization strategy of minimizing the training time, but they focus on reducing the communication time or computation time itself. Even more, these schemes often sacrifice accuracy. However, our proposed scheme focuses on balancing computation, communication, and synchronization time to minimize training time through worker allocation. To optimally balance the data and model parallelism, we provide a precise mathematical model for estimating the training time when the data, model, or hybrid parallelism is applied. Also, we provide the heuristics to solve the optimization problem of minimizing the training time acquired by the math model in a near-optimal manner. Existing schemes failed to solve the problem due to its large solution space, but our approach can determine the solution by approximation.

III. GROUP HYBRID PARALLELISM METHOD AND ITS TRAINING TIME MODEL DESCRIPTION
In this section, we propose a group hybrid parallelism method, describe its operation mechanism, and we present the training time model when group hybrid parallelism is applied. Group hybrid parallelism is an improvement of the existing hybrid parallelism. It exploits both the data parallelism and model parallelism simultaneously and divides the training workers into multiple groups. We discuss how the proposed parallelism approach accelerates training time and grants scalability in detail. Fig. 1 shows the overall architecture of the proposed group hybrid parallelism. As seen in Fig. 1, the group hybrid parallelization basically places workers (yellow rectangle in Fig. 1) in a 2D grid form that considers both data parallelism (vertical part in the figure; data parallelism is applied for workers associated with the same submodel) and model parallelism (horizontal part in the figure; the entire training model is partitioned through model partitioning process). In other words, the entire training model is divided into K sequential submodels, and all of the workers are divided into R groups. Workers belonging to a group are mapped to one submodel; thus, {w 1 , . . . , w k } workers are assigned to train K submodels in a single group, and Rw k workers are assigned to train the submodel G k . Training is performed in a unit of groups, and each group trains independently with equally divided batches. When a training iteration is over, workers belonging to all groups synchronize their model parameters to maintain accuracy.

A. SYSTEM OVERVIEW
The training procedure of group hybrid parallelism is similar to that of model parallelism. It is depicted in the right side of Fig. 1 with an example when the number of submodels K = 2 and the worker allocation w = 1, 2. When input data arrives at the submodel, it is equally distributed to all workers associated with the submodel, and the results obtained from all workers are collected to form activation data. The collected activation data is then propagated to the next submodel and the same process is repeated for all submodels. The process of distributing and collecting data within a group is implemented by MPI scatter and MPI gather, and the communication time occurring in this process is called the activation communication time (blue arrow and text in Fig. 1). When each group completes one training iteration, a local gradient update of the group is formed for each submodel. The local gradient updates from different groups will be accumulated to form a global gradient update. Model parameters are synchronized among workers and thus all workers assigned to the same submodel always have a constant parameter value regardless of the group. In this procedure, a communication delay occurs and it is called synchronization time (red arrow and text in Fig.1).
Group hybrid parallelism consists of a model partitioning step that divides the learning model into K submodels and a worker allocation step that divides workers into R groups and assigns workers belonging to the group to submodels. We describe the role of each step and why it is necessary for the training of large-scale DNN models. Each step will be further analyzed in Sections 4.2 and 4.3 to determine the model partition and worker allocation that minimize the training time.
Step 1: Model Partitioning Step: Model partitioning is the process of partitioning the entire training model with M layers layerwise to form K sequential submodels. As seen in Fig. 2, partitioning model parameters reduces synchronization time because synchronization time can be overlapped by different submodels; and thus model partitioning process must be considered for training a large-scale DNN model. Also, memory constraints are relaxed because the number of model parameters is reduced, allowing workers to train with a larger batch size.
Step 2: Worker Allocation Step: Worker allocation is a process of mapping workers to submodels, it is applied after the model partitioning process is completed. The worker mapping procedure divides NP workers into R equivalent groups, and it assigns the workers into K submodels to determine worker allocation w = {w 1 , · · · , w K }. Then, R(w 1 + · · · + w K ) = NP holds. By dividing workers into groups, the activation communication time can be drastically reduced. For example, consider the process of training the first submodel in Fig. 1. When workers are divided into groups, w 1 workers simultaneously train the same network and form activation of size D 1 /R. Meanwhile, if the workers are not divided into groups, Rw 1 workers simultaneously train the same network and form activation data of size D 1 . Therefore, worker grouping is essential for training large-scale deep learning models.

B. TRAINING TIME MODEL
In order to estimate the training time, we analyze the training procedure when group hybrid parallelism is applied. In this subsection, we design a mathematical model for training time of group hybrid parallelism when partitioned submodels G = {G 1 , · · · , G K }, the number of workers allocated to each submodel w = {w 1 , · · · , w K }, batch size B, and the number of groups R are given. To model the training time, we should analyze the training behavior of data and model parallelism.  Table 1.

1) COMPUTATION TIME
The computation time is the time it takes for workers to perform computation operations (feed-forward, and backpropagation) for training. It is calculated by dividing the computation workloads of the submodel by the sum of workers' FLOPs associated with the submodel. We assume that when the batch size is sufficient (≥ 32), computation workloads assigned to a worker are evenly distributed across cores and threads so that the worker performance is close to its FLOPS. Let C F k , C B k be the computation FLOPs of feedforward and backpropagation process in G k . Then, we can calculate the   computation time of G k by (1).
The communication time is a delay generated by exchanging activation or gradient data between adjacent submodels. It includes scattering input data to workers associated to the submodel and exchanging the activation or gradient data between adjacent submodels. The communication time is determined by the network bandwidth and the data amount.
Let input data with size D k−1 /R arrive to x k = w k /P compute nodes associated with submodel G k , and it generates activation of data size D k /R. When w k > 1, a delay occurs while distributing input data to workers, and gathering output activation to next submodel. We separated the communication time into inter-node communication time and intra-node communication time. Inter-node communication time is applied when node-tonode communication is occurred. The communication time is modeled by Hockney's algorithm [40], and we assumed all nodes in the same group are connected to each other. For inter-node communication, MPI scatter is utilized among nodes to scatter data to workers and it is modeled by MST or BKT algorithm [14]. Each worker's execution result is directly transferred to the next submodel. Also, direct communication via PCI-e link is applied for intra-node communication. Putting up altogether, we can formulate

3) SYNCHRONIZATION TIME
The synchronization time is the delay caused by collecting weight gradients from workers and updating new parameters. The synchronization time is also comprised of inter-node synchronization time and intra-node synchronization time. Intra-node synchronization time is modeled by direct communication using the PCI-e link. Inter-node synchronization process is modeled by MPI Allreduce, which is already VOLUME 9, 2021 modeled considering various network topologies and algorithms [41]. Synchronization time based on MPI Allreduce is a function of the number of nodes to synchronize, denote by p k (w k ), should be clarified. It can be obtained by multiplying the number of groups R by the number of nodes which are allocated to the submodel in a single group. Equation (3) formulates the number of nodes to synchronize the submodel G k .
where ∧, ∨ denotes min, max operations respectively. The synchronization time can be described as (4) when Ring-AllReduce algorithm [41] is applied.
Note that the synchronization time is a form of increasing step function of w k by interval P, and thus we can assume synchronization time as constant over w k for the partitioned domain. Also, the synchronization time model can be adaptively changed according to applied interconnection network or algorithm [41].
Using the mathematical model of computation, communication, synchronization time, we can model training time when data, model, and group hybrid parallelism is applied. As described in Fig. 2, the training time of the data parallelism is simply acquired by adding the computation time and the synchronization time. The training time of the model parallelism is acquired by adding the computation time and the communication time of each submodel. When the group hybrid parallelism is applied, the synchronization time is hidden in the overall training process so the synchronization time is counted as the overall training time only once throughout all of the submodels. The overall training time is determined based on the submodel with the latest parameter update (Submodel #1 in Fig. 2). As submodel i starts synchronization when it finishes the backpropagation process, submodel i finishes its parameter synchronization process at: where , and δ k β R which is constant over w k . Finally, the training time in hybrid parallelism can be expressed as (6) when all submodels finish their parameter update. Note that denotes the maximum element. k p is the submodel index that finishes the training most late. The synchronization time of submodel G k p is accounted for training time and we call the submodel G k p by sync-submodel.
C. CONVERGENCE ANALYSIS OF GROUP HYBRID PARALLELISM As depicted in Fig.1, the global gradient update is calculated by the local gradients update acquired by each group. Let the parameter of the submodel G k submodel be φ k , and local gradient of r th group after training G k be ∇φ k,r . Then the model parameter at t th iteration can be formulated by (7): where ∇φ k = ∇φ k,1 + · · · + ∇φ k,R R With the update rule in (7), the convergence rate of group hybrid parallelism is the same as that of conventional minibatch SGD using full batch size B, which has a convergence rate of O(1/

√
Bt + 1/t) at iteration t [39]. This is because the gradients obtained by group hybrid parallelism has the same value as the gradients acquired by minibatch SGD with full batch B and it is proved in Proposition 1.
Proposition 1 (Convergence Analysis of GHP): The update rule formulated in (7) is equivalent to that of minibatch SGD with full batches.
Proof: Let f (x i , y i ) be the loss function obtained by i th test data, and η t be the learning rate at iteration t. Then, the gradient update of the minibatch SGD with batch size B can be formulated as: When group hybrid parallelism is applied, data batch B are equally distributed among R groups.
Through Proposition 1, we can conclude that the group hybrid parallelism does not sacrifice accuracy because it has the same update rule to synchronous SGD.

IV. DETERMINING SYSTEM CONFIGURATION FOR GROUP HYBRID PARALLELISM
In section 3, we explained the concept of group hybrid parallelism and provided a mathematical model of the training procedure. To get benefit from group hybrid parallelism, the model partitioning and the worker allocation technique should be developed to determine a system configuration that can minimize training time among solution spaces. This section deals with the worker allocation problem to minimize training time when group hybrid parallelization is applied.

A. WORKER ALLOCATION MODEL FOR GROUP HYBRID PARALLELISM
We focus on training a large-scale DNN model with a large dataset by a distributed computing infrastructure. We are trying to train a DNN model with M layers {L 1 , · · · , L M } with N homogeneous computing nodes where each node has P workers in it. Each worker has compute capability per second f and memory size m. In other words, an idle computing node has the computing capability of P · f and memory size P · m. To increase the utility of computing resources, a batch size of b or more should be assigned to each worker. We ignore any faults and inferences that may occurred in the training process.
The objective is to minimize the training time formulated in (6). As discussed in Section 3, the training time can be formulated by model partition G = {G 1 , G 2 , · · · , G K }, number of groups R, and worker allocation to submodel in a single group w = {w 1 , w 2 , · · · , w k }. Thus, the training time can be minimized by adjusting them. Also, as described in section 3, we have three constraints: (1) the number of workers is restricted by NP/R (Worker constraint), (2) assigned activation data and model parameters M k /w k should not exceed the memory size of the single worker m (Memory constraint), and (3) batch size for a single worker B/Rw k should be allocated more than b (Batch constraint). Equation (10) and Constraint (11)(12)(13) shows the linear programming form of minimizing training time when group hybrid parallelism is applied. where The above problem is a form of the open-shop scheduling problem [42], which is proven to be NP-hard. Because model partitioning and worker allocation should be considered simultaneously to determine the system configuration, the solution space is too large to find an optimal solution; therefore, we cannot solve the problem directly. Therefore, we propose heuristics to determine the system configuration to minimize training time.
Our scheme have two steps: Model Partitioning and Worker allocation. Model partitioning step divides the entire training model into K submodels G = {G 1 , · · · , G K } that minimize the training time. Worker allocation step determines the worker allocation strategy w = {w 1 , · · · , w K } and the number of groups R when the submodel set is given. It is comprised of four substeps and the result obtained by worker allocation step is proven to be near-optimal. We will explain each procedure in detail.

B. MODEL PARTITIONING
The model partitioning step is a process of dividing the entire training model layer-wise to form a set of submodels. A submodel is a set of one or more subsequent layers, and all layers must be included in one submodel. Model partitioning determines the parameter size of the submodel, so it has a great influence on determining the synchronization time. Moreover, model partitioning also determines the activation size between adjacent submodels so it has an important effect on determining the communication time.
Because the training behavior changes as K changes, determining the number of submodels K is an important issue for model partitioning procedure.
When K = 1, the submodel is equivalent to the entire training model, so the training is performed in the same way as data parallelism. Therefore, all of the model parameters must be synchronized, but there is no communication time to pass the activation. This case is useful when the number of model parameters or the number of workers is small.
When K ≥ 2, both data parallelism and model parallelism are considered together so that both parameter synchronization and activation communication occur. If K is equal to the number of workers NP, this is the case when pure model parallelism or pipeline parallelism is applied because only one worker should be placed in the submodel. When a model is divided into submodels only one submodel with the slowest synchronization (sync-submodel) is accounted for training time because each submodel independently performs parameter synchronization. However, communication time occurs for each adjacent point of the submodels; thus, worker efficiency is lowered as K increases. Therefore, when K > 2, training tends to be inefficient [16].
As model partitioning plays an important role in determining the synchronization time and communication time, the proposed model partitioning scheme basically considers all possible partitioning cases by increasing K from 1 to (NP ∧ M ). Then, we obtain the system configuration that minimizes the training time for each available model partition (this will be described in Section 4.3). At this time, the proposed method analyzes the trend of the minimum training time by changing the number of submodels, and then stops training at the elbow point. In other words, when the min-imized training time increases as the number of submodels increase, the algorithm stops and returns the submodel set and system configuration which minimizes the training time. The time complexity of model partitioning can thus be expressed as O(M K ). Note that K usually stops at 2-3 in most cases.
In summary, the model partitioning procedure returns the submodel set that minimizes the training time considering all partitioning cases until the elbow point is found while increasing the number of submodels K .

C. WORKER ALLOCATION
The worker allocation step returns the number of groups R and the worker allocation of a single group w that minimizes the training time when the submodel G is given. Therefore, we deal with the problem of finding the solution of the primal problem formulated in the (10-13) when the submodel set G is specified. In this procedure, we only focus on the number of workers w = {w k } to minimize the objective function. We treat R as a constant and execute worker allocation procedure repeatedly by changing the value of R. Also, we treat the synchronization time T sync k p as constant by fixing the domain range of w k p .
To find the solution through optimization technique, the objective function in (10) must be convex. To make the objective function convex, we eliminate non-promising solutions to reduce the solution space and approximates the objective function to make it convex. After then, we apply an optimization technique to minimize the objective function while satisfying all conditions.
In worker allocation step, we have four substeps to solve the problem. The first three substeps reduce the solution space through pruning or approximation of the worker allocation problem and finally make the objective function convex. The final substep then determines the optimal solution of the given submodel G and the number of groups R. We will explain each substep in detail.

1) SUBSTEP 1: PRUNING NONPROMISING SOLUTIONS FOR y k < 0
When y k < 0, the term y k /w k increases when w k increases. Therefore, if one or more y k has negative value, the objective function T (w) = K k=1 y k /w k + T sync k p cannot be convex. In this substep, we eliminate the negative y k value to make the objective function convex. When y k < 0, the increased effect of the training time due to communication is much larger than any other reduction effects of training time as the number of workers increases. Thus, the number of workers allocated to the y k < 0 submodel should be minimized to reduce the communication. When y k > 0, the training time decreases as w k increases. Thus, we should balance workers only for submodels with positive y k . This is summarized in Proposition 2 and it is proved through contradiction (see Appendix.).
With Proposition 2, we can remove cases with y k < 0, and the problem is equivalent to the same problem for additional constraint: 2) SUBSTEP 2: FINDING SOLUTION FOR SYNC-SUBMODEL In this substep, we determine the number of workers for sync-submodel w k p when y k p ≥ 0, to make the problem more simple. As described in Section 3.2, the synchronization time is a form of increasing step function of w k p , with interval P and we treat T sync k p as a constant by dividing the domain region of w k p . Let w * k p be the number of workers for sync-submodel that minimizes training time subject to (11)- (14). Then, we can determine the range of workers for sync-submodel and it is described in Proposition 3. Detailed proof of Proposition 3 is described in Appendix A. (14).
With Proposition 3, we can find the possible candidates of w * k p , and we choose one value among them which minimizes the training time.

After
Step 1 and 2, the minimization problem described in Section 4.1 is equivalent to minimizing T (w) = K k=1 y k /w k , while satisfying y k ≥ 0, K k=1 w k = W , and M k /m ≤ w k ≤ B/bR. where W is the positive integer which is smaller than or equal to NP/R. Still, we can't solve the optimization problem since T (w) is not a convex function. T (w) is convex when f 1 , · · · , f K are convex. However, f k is not continuous when w k is an integer multiple of P. Therefore, we should clarify the domain region of w k to make f k continuous and convex. In this substep, we determine the number of nodes x k to make f k convex.
To determine the region of w k , we first approximate y k into constant and define y k in (15).
Then, we solve the approximated primal problem of minimizing T (w) = K k=1 f k (w k ) while satisfying (11)- (14), Note that T : R K → R, f k : R → R are finite-valued and convex function, and (11)- (14) are closed convex set. Then, we can solve the approximated problem using duality. It is summarized in Lemma 1 (See Appendix for detailed proof).
. If adjusted w k value do not satisfy boundary condition, then we adjust w k value again with the same procedure until all w k value meet boundary condition. However, the solution {w k } acquired by Lemma 1 is an approximation to the optimal solution since y k = y k . Theorem 1 analyzes the relationship between solutions of approximated problem and original problem. In conclusion, w k is an (1+ )−approximation to w * k and they have the same number of nodes x k .
Theorem 1 (Optimality of the approximated solution): Then, w k is an (1+ ) approximation to the optimal solution w * k , and (x k − 1)P ≤ w * k ≤ x k P holds. Proof: Let y k = y k + y k ≤ y k (1 + ). By Lemma 1, It is obvious that w * k is minimized when y i = y i (1 + ) except for i = k, and w * k is maximized when y i = y i (1 + ) only for i = k. Thus, (17) holds, which means that w k is (1 + )-approximation to the optimal solution w * k .
Also, y k diminishes to 0 when r k increases to P. Thus, w * k will become larger compared to w k when r k is small, and vice versa.
Theorem 1 implies that the approximated solution w k and the optimal solution w * k are similar to each other and they have the same number of nodes. Therefore, we can determine the range of the optimal solution w * k , and the objective function becomes convex. Finally, we can solve the objective function at substep 4.

4) SUBSTEP 4: FINDING THE SOLUTION
Let w k and x k be the number of workers and nodes to minimize the approximated problem in substep 3. Finally, through substep 1 to 3, the original problem in (10)-(13) is equivalent to minimizing T (w) = K k=1 y k w k (18) where (x k − 1)P ∨ M k m ≤ w k ≤ x k P ∧ B bR , K k=1 w k ≤ W , and y k ≥ 0 holds. As the range of w k is given, the term w k P becomes constant. Thus, we re-formulate y k as y k by: Therefore, based on Lemma 1, T (w) is minimized when where S is the sum of w * k which has a value of M k m , B bR , x k P, or (x k − 1) P.
Through substep 4, we obtained the number of workers to minimize the objective function when w * k p is fixed. By applying (20), we can obtain the optimal number of workers when R and G is given. We repeat Step 3-4 for every available candidate in Step 2, and find the solution which minimizes the training time.
To sum up, the worker allocation step has 4 substeps: pruning nonnegative solution of y k < 0, determining a solution for sync-submodel to make the problem more simple, approximate the problem to figure out the number of nodes, and finding the solution. The worker allocation step is repeated for every model division acquired from Model Partition process and finds the global solution. The overall process of worker allocation is summarized in Algorithm 1.
The time complexity of the worker allocation scheme can be formulated as O(KlogN ). Since the time complexity of the model partitioning scheme is formulated as O(M K ), the overall time complexity of the GHP can be formulated as O(M K KlogN ). Because K is less or equal to 2 for most cases, we can figure out that our proposed heuristics is not a time-consuming job and the scheduling overhead can be ignored.

V. EVALUATION
In this section, we evaluate the performance of our proposed worker allocation schemes in distributed training with discussion.

A. EXPERIMENTAL SETUP AND METRICS
(Implementation) To verify the performance of the proposed group hybrid parallelism and worker allocation scheme, VOLUME 9, 2021

Algorithm 1 Worker Allocation Scheme for Group Hybrid Parallelism
Input: G: Set of submodels Output: R: Number of groups, w: worker allocation for all k where y k > 0 and k = k p do we implemented a scheduler and a distributed trainer based on Python and PyTorch. Fig. 3 depicts the architecture of the implemented scheduler and distributed trainer system. We briefly introduce how we implement each of them.
The scheduler determines the model partition and worker mapping that minimizes the training time when the node specification, network model, and batch size is given. The scheduler consists of the model partitioning function that splits the model into submodels, the worker grouping & mapping function that groups and maps the workers to the submodel, and the model profiler function that analyzes the computation FLOP of each layer. For convenience, the implemented model profiler does not predict computation FLOPs based on the measurement of real execution time; rather it calculates the FLOPs of direct calculation based on the layer structure [38]. Because deep-learning frameworks implement their own optimized operations instead of direct calculation, the estimated computation time may be different from the real execution value. However, our prediction of direct calculation is meaningful in that it can express the trends of computation FLOPs, even if it is not accurate.
The distributed trainer applies the node configuration obtained by the scheduler and executes training on the cluster nodes. The cluster manager manages the overall training process of cluster nodes. The metadata manager manages the communication process of collecting and distributing activation data among nodes or workers in the group. The gradient synchronization manages the process of forming a global gradient change by collecting local gradient changes generated by all workers, and distributing them to all training workers.
To manage the data communication among workers or nodes, our implementation uses two agents: group agents and node agents. Fig. 4 presents how agents manage data  communication among workers. A group agent exists for each group and manages data communication that occurs within the group. As depicted in Fig. 4 (b), multiple agents can exist for a single group when the group contains more workers than a node and they can communicate with each other through a node agent. Also, node agents exist for each node and they manage communication among nodes, including parameter synchronization among groups and data communication within a group. These processes were implemented through PyTorch and MPI.
(Testbed Specification and Training Model) In this study, we ran experiments on eight Nvidia RTX2080 workers. Each worker had 10.07 TFLOPS of performance with 368 tensor cores and 8GB GDDR6 memory. Each node contained a Xeon Phi CPU, 64GB memory, and 4 workers connected through a PCI-e channel. Nodes were connected to each other via a 1GBps Ethernet switch. We extended the node-to-node bandwidth to 10GBps and workers to 1024 for weak scaling test by emulation, based on the measured training time for laboratory scale. Cuda version 10 was installed in our testbed. We ran experiments for various network models and parallelism methods. We considered VGGNet16, VGGNet19, RESNet50, RESNet101, and RESNet152 models with the ImageNet-1K dataset.
(Baselines) We compare GHP with three types of parallelism methods: synchronous data parallelism [2], the one weird trick (OWT) method [20], and GPipe [34]. (i) Synchronous data parallelism is a data parallelism method of synchronizing all parameters for every iteration, and it is adopted for many frameworks due to its simplicity. (ii) OWT is an empirical hybrid parallelism method of separating conv and fc layers and applying data parallelism to each separation. (iii) GPipe is a pipeline parallelism method to divide the entire training model into stages and accelerate training by dividing batches into microbatches. Note that all of the parallelism methods listed above do not sacrifice accuracy so the accuracy remains the same.
To measure the scalability, we used speedup, which is the throughput divided by throughput of a single worker [43]. speedup(n) = throughput of n workers throughput of 1 worker (22) In our experiment, we did not measure the accuracy of the training model because the parallelism methods used in our experiment (sync-DP, OWT, GPipe, and proposed model) do not sacrifice accuracy at all.

B. PERFORMANCE EVALUATION
We present three experiments to evaluate our proposed training time model and group hybrid parallelism. As shown in Fig. 5, the proposed training time model predicted the training time at a significant level. For the 46 configurations, the mean average percentage error (MAPE) was calculated as about 35% for the computation time, and about 15% for the communication, synchronization, and total training time. There are several sources for possible errors between estimated training time and real execution time. We observed that the computation time showed a larger error than other parts (see Fig. 5(a)) because we estimated the computation time based on the amount of calculation that occurs when the DNN operation is calculated directly. However, in a real environment, DNN operation is calculated through various methods such as GEMM or FFT instead of direct calculation. We expect a more precise estimation of computation time if we estimate the computation time based on profiling. Also, since each training framework is implemented in a different way, there are many other parameters that affect learning, such as a delay for each framework. Therefore, more parameters and variables should be considered for precise estimation. However, the training time model in this study does not aim to predict the exact level of training time, but the goal is to observe the trend of training time for given configurations (set of sub-models and worker assignment). Thus, We considered the above overheads beyond the scope in this paper. Despite these limitations, a low level of error occurred between the estimated training time and the actual training time, so that the estimated value through the mathematical model could be sufficiently utilized for scheduling.
In conclusion, our proposed math model can serve as a crucial tool for worker configuration, regardless of hardware configuration, such as training model, batch size, and worker allocation.

2) EFFECT OF GROUPING WORKERS
This experiment analyzed the effect of grouping workers in terms of the training time. We fixed the number of workers belonging to a single group and analyzed the minimum training time considering all possible configurations of the case of 8 workers. Fig. 6 is a graph representing the minimum training time with a variation of the number of workers from 1 to 8 in a single group. Fig. 6 can be interpreted by dividing the domain by three regions. For each region, a different parallelism method is applied. When the number of workers in a group is 1 ( w k = 1, grey region in Fig. 6), it means only one submodel exists for the training; thus, synchronous data parallelism is applied at this region. When the number of workers in a group is larger than the medium (4 < w k ≤ 8, blue region in Fig. 6), it means only one group exists and more than two submodels can exist for the training; thus, model parallelism or hybrid parallelism is considered at this region. When the number of workers in a group is smaller than or equal to the medium (1 < w k ≤ 4, green region in Fig. 6), more than two groups and more than two submodels can exist for the training; thus, group hybrid parallelism is considered at this region.
As the characteristics of each training model differ, the shapes of the graphs also differ and are unpredictable. However, as seen in Fig. 6, we observed that the minimal training time generally appears when the number of workers in a group is medium (1∼ 4). This region cannot be discovered when we only consider data, model, or hybrid parallelism, so group hybrid parallelism finds out a better solution compared to that of data, models, and hybrid parallelism. Interestingly, large-scale models (VGGNet19, RESNet152) showed a great training time drop when the number of workers is close to 4. Because the proposed GHP scheme balances computation, communication, and synchronization time, it is especially effective when training large-scale models. So we can conclude that large-scale models can benefit from GHP because the synchronization time is greatly reduced by dividing model parameters into submodels. Moreover, when the number of workers in a group is multiplication or division of 4, workers in a node can be fully utilized because each node has 4 workers in it. As depicted in Figure 6, the training time tends to reduce when the number of workers is 1, 2, 4, 8.
In conclusion, the solution space of group hybrid parallelism includes that of data, model, and hybrid parallelism. Moreover, we should determine an efficient hardware configuration in order to benefit from group hybrid parallelism. Table 2 shows the weak-scaling performance of the proposed scheme. We measured the throughput when group hybrid parallelism is applied, and compared the throughput with that of a synchronous data parallelism scheme (Sync-DP), empirical model division (OWT), and a pipeline parallelism scheme (GPipe) by simulation. We increased the number of workers from 1 to 1,024 and the batch size from 32 to 32,768 and measured the speedup. When we run a simulation for large-scale workers, we applied ring-allreduce algorithm for synchronization and set the node-to-node bandwidth to 10Gbps. The blank cells of the table indicate infeasible parts due to memory limitation or inability of partitioning the graph into many stages. We found some observations and we will analyze them in relation to the network, batch size, and network bandwidth aspects.

3) WEAK SCALING PERFORMANCE EVALUATION
First, in terms of throughput, we found that our proposed scheme outperforms the other schemes for most cases. As mentioned in Section 5B, the solution space of group hybrid parallelism includes that of data, model, and hybrid parallelism; thus, group hybrid parallelism has a better performance compared to the other schemes.
However, because the group hybrid parallelism does not consider pipelining at all, GPipe has a different solution space compared to GHP so there are some cases in which GPipe performs better than the group hybrid parallelism. As shown in Table 2, we observed two points. (1) GPipe performs much better in the case of VGGNet: In VGGNet, there are many convolution operations with similar computation load, so when the model is divided into partitions, the throughput bias is small among them. (2) When the batch size increases, the throughput gap becomes smaller: As the batch size and the number of workers increases, the throughput gap between GHP and GPipe becomes smaller. This is because GHP minimizes activation size by grouping workers. Moreover, the performance of GPipe is greatly affected by the number of microbatch. When the number of microbatch increases, the bubble overhead becomes smaller and ignorable so the throughput is increased. However, when the number of microbatch increases, the memory consumption also increases because the activation must be kept in memory until all microbatches are trained sequentially, so there is a tradeoff.
Second, our proposed scheme is scalable. As seen in Table 2, GPipe cannot be applied for a large number of workers, because it is hard to partition the model into many stages. To apply pipeline parallelism to K workers, the training model should be divided into K stages. However, it is hard to partition and balance stages when the number of workers is more than 8. Also, hybrid parallelism schemes like OWT showed low scalability because the amount of data exchanged in model parallelism is proportional to batch size. We observed that the communication time in OWT increased drastically as the number of workers increases. However, the amount of data exchanged in the group hybrid parallelism is fixed, so it is independent of batch size. This observation implies the importance of the model-division scheme and the excellence of group hybrid parallelism. In other words, applying a simple model-division scheme may result in poor performance compared to the data parallelism, and we should apply the optimized model-division scheme.
Furthermore, we found that the structure of the training model and its parameter size affect the performance. As seen in Table 2, OWT performs better than Sync-DP in the VGGNet model, and vice versa in the RESNet model. This means we should consider the model structure to determine the optimal parallelism strategy. However, group hybrid parallelism considers model structure in model partitioning process so it results in best performance regardless of the training model. Because model parameter size is directly related to the synchronization delay, it affects the performance. When the model parameter size is small, the gain obtained by model partitioning is small, so the effectiveness of the proposed scheme is not significant. For models with small parameter size, such as RESNet50, the performances of proposed scheme and Sync-DP did not show much difference. However, for models with large parameter sizes, such as VGGNet19, the proposed scheme outperformed the Sync-DP. This means that model partitioning is effective when the model size is large.
To sum up, we observed that our scheme greatly reduces the training time by utilizing group hybrid parallelism. We compared our scheme with data parallelism (sync-DP), hybrid parallelism (OWT), and model parallelism (GPipe) schemes and found that our scheme outperforms them in terms of throughput and speedup. Also, we observed that our scheme is useful for models with the large number of model parameters.

VI. CONCLUSION
In this paper, we addressed the limitations of existing parallelism schemes for training large-scale DNN models. We observed that existing parallelism schemes lack scalability due to the long communication time and worker memory limitation, and we proposed a group hybrid parallelism scheme to enhance scalability while maintaining accuracy. By analyzing the training behavior, we presented a training time model when group hybrid parallelism is applied. Based on the training time model, we proposed a worker allocation scheme to minimize the training time. Our solution is scalable as we minimize the communication time by grouping workers, and our solution resolves the worker memory limitation issue by partitioning the model parameters. Moreover, because we do not change model parameter size and parameter staleness, the accuracy remains high.
We implemented a scheduler and a distributed trainer to evaluate the performance of our proposed schemes, and we conducted experiments on a laboratory scale. First, we compared the estimated training time acquired by the training time model and real training time. In this experiment, the total training time was about 20% of the MAPE value, so we concluded that our model can represent the actual training time. Second, we measured the minimum training time by varying the number of workers in a single group to prove that the group hybrid parallelism outperforms the existing parallelism schemes. In most cases, we found that the training time has a minimum value when the number of workers in the group is not too small (data parallelism) or too large (model, hybrid parallelism). Finally, we conducted a weak scaling test by comparing the throughput of various parallelism schemes by increasing the batch size and the number of workers.