DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Extensive experiments demonstrate that DeepNet has superior performance across various benchmarks, including machine translation, language modeling (i.e., BERT, GPT) and vision pre-training (i.e., BEiT). Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.


Introduction
Recent years have witnessed a trend towards large-scale Transformer (Vaswani et al., 2017) models.The capacity has substantially increased from millions of parameters (Devlin et al., 2019;Conneau et al., 2020) to billions (Radford et al., 2019;Brown et al., 2020;Huang et al., 2019;Raffel et al., 2020;Lepikhin et al., 2021;Rae et al., 2021;Lin et al., 2021;Smith et al., 2022), and even trillions (Du et al., 2021).Large-scale models yield state-of-the-art performance on a wide range of tasks, and show impressive abilities in few-shot and zero-shot learning.Despite an enormous number of parameters, their depths (as shown in Figure 1) are limited by the training instability of Transformers.
Nguyen and Salazar (2019) find that pre-norm residual connections (Pre-LN) improve the stability of Transformers based on post-norm connections (Post-LN).However, the gradients of Pre-LN at bottom layers tend to be larger than at top layers (Shleifer et al., 2021), leading to a degradation in performance compared with Post-LN.In order to alleviate the above issue, there have been efforts on improving the optimization of deep Transformer by means of better initialization (Zhang et al., 2019a;b;Huang et al., 2020), or better architecture (Wang et al., 2019;Liu et al., 2020;Bachlechner et al., 2020;Shleifer et al., 2021).These approaches can stabilize a Transformer model with up to hundreds of layers.Yet, none of previous methods has been successfully scaled to 1,000 layers.
Our aim is to improve the training stability of Transformers and scale the model depth by orders of magnitude.To this end, we study the cause of unstable optimization, finding the exploding model update is responsible for the instability.Motivated by the above observation, we introduce a new normalization function (DEEPNORM) at residual connections (He et al., 2016), which has theoretical justification of bounding the model update by a constant.The proposed method is simple yet effective, with just lines of code change.The approach improves the stability of Transformers so that we are able to scale model depth to more than 1,000 layers.Moreover, experimental results show that DEEPNORM combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN.The proposed method can be a preferred alternative of Transformers, not only for extremely deep (such as >1000 layers) models, but also for existing large models.Notably, our 200-layer model with 3.2B parameters achieves 5 BLEU improvement on a massively multilingual machine translation benchmark compared to state-of-the-art model (Fan et al., 2021)  (e.g., NMT, T5) Figure 2: (a) Pseudocode for DEEPNORM.We take Xavier initialization (Glorot and Bengio, 2010) as an example, and it can be replaced with other standard initialization.Notice that α is a constant.(b) Parameters of DEEPNORM for different architectures (N -layer encoder, M -layer decoder).
As shown in Figure 2, it is simple to implement our method based on Transformers with Post-LN.Compared to Post-LN, DEEPNORM up-scales the residual connection before performing layer normalization.Besides, we down-scale the parameters during initialization.Notably, we only scale the weights of feed-forward networks, as well as the value projection and the output projection of attention layers.Moreover, the scales of residual connection and initialization are dependent on the architecture (Figure 2).We provide more details in Section 4.3.

Instability of Deep Transformer
We study the causes of the instability for deep Transformers.Our analysis begins with the observation: better initialization methods stabilize the training of Transformer.This has also been verified by previous work (Zhang et al., 2019a;Huang et al., 2020;Xu et al., 2021).Therefore, we study the training process of Post-LN with or without proper initialization.With better initialization, we downscale the weights of l-th layer by k l = N − l + 1, l ∈ [1, N ] after performing Xavier initialization.For example, the output projection W l o of FFN in l-th layer is initialized as: , where d is an average of input and output dimensions.We name this model Post-LN-init.Notice that different from the prior work (Zhang et al., 2019a), we narrow the scale of lower layers instead of the higher layers.We believe that it helps to separate the effect of the gradient scale from the model update.Besides, Post-LN-init has the same architecture as Post-LN, which eliminates the impact from the architecture.
We train 18L-18L Post-LN and 18L-18L Post-LN-init on the IWSLT-14 De-En machine translation dataset.Figure 3 visualizes their gradients and validation loss curves.As shown in Figure 3(c), Post-LN-init converged while Post-LN did not.Post-LN-init has an even larger gradient norm in the last several layers, although its weights have been scaled down.Furthermore, we visualize the gradient norm of the last decoder layer with varying model depth from 6L-6L to 24L-24L. Figure 3 shows that the gradient norm of Post-LN-init in the last layer is still much larger than that of Post-LN, regardless of model depth.It concludes that the exploding gradients in deep layers should not be the root cause of instability of Post-LN, while the scale of model update tends to account for it.
Then we demonstrate that the instability of Post-LN comes from a chain of several issues, including gradient vanishing as well as too large model updates.As shown in Figure 4(a), we first visualize the norm of model update ||∆F || at the early stage of training:  2020), the magnitude of gradient through LN is inversely proportional to the magnitude of its input: As training continues, the gradient through LN becomes increasingly small, thus resulting in severe gradient vanishing.The vanishing gradients make it difficult to escape from the local optima, and further destabilize the optimization.On the contrary, Post-LN-init has relatively small updates, and the inputs to LN are stable.This relieves suffering from gradient vanishing, making optimization more stable.

DEEPNET: Extremely Deep Transformers
In this section, we introduce our extremely deep Transformers named DEEPNET.It can stabilize the optimization by mitigating the exploding model update problem.We first provide the estimation of the expected magnitude of DEEPNET's model update.Then we provide the theoretical analysis to show that its updates can be bounded by a constant with our proposed DEEPNORM.

Architecture
DEEPNET is based on the Transformer architecture.Compared to the vanilla Transformer, it uses our new DEEPNORM, instead of Post-LN, for each sub-layer.The formulation of DEEPNORM can be written as: where α is a constant, and G l (x l , θ l ) is the function of the l-th Transformer sub-layer (i.e., attention or feed-forward network) with parameters θ l .Besides, DEEPNET scales the weights θ l inside residual branches by β.Notably, both α and β are constants that only depend on the architecture, and we provide the derivation in Section 4.3.

Expected Magnitude of Model Update
Attention is an important part of Transformer.Without loss of generality, we study the 1-head case.Let Q, K, V ∈ R n×d denote the query, key, value, respectively.W Q , W K , W V ∈ R d×d k are the input projection matrices, and W O ∈ R d k ×d is the output projection matrix.Then, the attention module can be formulated as: We study the magnitude of the attention module.Lemma 4.1 proves that W Q and W K do not change the bound of attention output's magnitude.Lemma 4.1.
where Θ = stands for equal bound of magnitude.
In other words, the magnitude of attention output only depends on the value and output projection: In this work, we only consider the magnitude of model update, so it is sufficiently instructive to study the case where the hidden dimension equals to 1.For simplicity, we reduce the matrices W V , W O to the scalars v, w, which means Attn(Q, K, V ) Θ = vwV .Similarly, we have F F N (X) Θ = vwX, where v, w denotes the parameters of the feed-forward network.
We define the model update as Based on the analysis above, we have the following theorem to characterize ||∆F ||'s magnitude of an N -layer DEEPNET with N attentions and FFNs.Theorem 4.2.Given an N -layer DEEPNET F (x, θ) (θ = {θ 1 , θ 2 , ..., θ 2N }), where θ 2l−1 and θ 2l denote the parameters of self-attention and FFN in l-th layer, and each sub-layer is normalized with DEEPNORM: Vanilla Post-LN can be regarded as a special case of DEEPNET, where α = 1 and v l = w l = 1 at Xavier initialization (Glorot and Bengio, 2010).Based on Theorem 4.2, we have It shows that the model tends to accumulate the update of each sub-layer, which leads to exploding magnitude of model's update and destabilizes the optimization at the early stage.This explains our findings in Section 3.
Besides, Theorem 4.2 also explains why warm-ups and smaller initialization can stabilize the training of Post-LN.Warm-ups can reduce the magnitude of the model update by decreasing ||θ * i − θ i ||, while smaller initialization lowers v 2 i + w 2 i .Furthermore, we study the magnitude of DEEPNET with an N -layer encoder and an M -layer decoder.Let F ed (x, y, θ e , θ d ) denotes the model, where x, y is the input of encoder and decoder.θ e follows the same definition as θ in Theorem 4.2.θ d = {θ d1 , θ d2 , ..., θ d,3M } stands for the parameters of selfattentions, cross-attentions, and FFNs.We use {α e , G el } and {α d , G dl } to distinguish the notations between the encoder and the decoder.The following theorem shows the expected magnitude of the encoder-decoder's model update Given an encoder-decoder DEEPNET F ed (x, y, θ e , θ d ) with N encoder layers and M decoder layers, where each encoder sub-layer is normalized as x l+1 = LN (α e x l + G el (x l , θ el )), and the decoder sub-layer is normalized as The vanilla encoder-decoder model satisfies that all of {α e , α d , v ei , w ei , v di , w di } equal to 1, so we have It indicates the similar accumulative effect which leads to fast growth of the magnitude regarding the model depth (see Figure 5).Furthermore, the cross-attention propagates the magnitude from the encoder to the decoder, which explains why the decoder is more unstable than the encoder (Liu et al., 2020).

Derivation for DEEPNORM and the Initialization
We show that the expected model updates for DEEPNET can be bounded by a constant with proper parameters α and β.Our analysis is based on SGD update, and we empirically verify it works well for Adam optimizer (Kingma and Ba, 2015).We provide the analysis on the encoder-decoder architecture, which can be naturally extended to encoder-only and decoder-only models in the same way.Analogous to Zhang et al. (2019b), we set our goal for the model update as follows: GOAL: F ed (x, y, θ e , θ d ) is updated by Θ(η) per SGD step after initialization as η → 0. That is There are multiple schemes to bound Equation (2) by Θ(η).In order to balance the effect of residual connections and the initialization, we set α 2 d = (3M ) In comparison with Post-LN, we visualize the model updates for DEEPNET on IWSLT-14 De-En translation dataset at the early training stage.Figure 5 shows that the model update of DEEPNET is nearly constant, while the model update of Post-LN is exploding.
We use BLEU as the evaluation metric for all experiments.Table 1 reports the results of the baselines and DEEPNET on WMT-17 En-De translation dataset.According to their LNs, the baselines are grouped into three categories: Pre-LN, Post-LN, and No-LN.All the compared models are base-size with different depths.
Compared with the models with Post-LN, DEEPNET is more stable, and can successfully scale to 100L-100L, reaching the 28.9 BLEU on the test set.In contrast, the baselines with Post-LN lead to   unstable optimization when the depth goes to 50L-50L.Besides, DEEPNET achieves comparable performance with these baselines when the models are shallow.
In addition, we compare DEEPNET with the methods without LN.Both R-Fixup and T-Fixup introduce better initialization methods, which stabilize the training of No-LN Transformer with up to 50-50 layers.Yet, their performance is not as good as those with Post-LN.Besides, half-precision could destabilize the training of ReZero, leading to its divergence with 18-18 layers.This observation is also reported by Liu et al. (2020).Moreover, deeper models (50L-50L) do not outperform the shallow models (18L-18L).In comparison, DEEPNET achieves better translation accuracy than these methods, and scaling to deeper models brings no harm to the performance.
Compared with the Post-LN baselines, the models with Pre-LN are more stable.Both vanilla Pre-LN and DLCL can be scaled to 100L-100L, and 50L-50L NormFormer is also trained successfully.Nevertheless, Pre-LN leads to a 0.5-1.0BLEU drop compared with the converged Post-LN models.We presume this should be caused by the problem that gradients of Pre-LN at earlier layers tend to be larger than gradients at later layers (Shleifer et al., 2021).We leave it as the future work.In contrast, DEEPNET alleviates the problem by using Post-LN, and outperforms all the Pre-LN baselines.
Convergence with varying depth.We vary the depths of the models from 10L-10L to 100L-100L with an interval of 10 layers.All experiments are conducted with mixed precision training, except ReZero3 .Figure 6 shows the results on the IWSLT-14 dataset.We train the models for 8,000 steps because we find most divergence occurs at the beginning of optimization.Overall, DEEPNET is stable from shallow to deep.It converges fast, achieving over 30 BLEU in only 8,000 steps while most of the baselines do not.Moreover, the performance keeps improving as the model goes deeper.
Large learning rate, batch size, and hidden dimension.We further scale DEEPNET to larger learning rate, batch size, and hidden dimension, respectively.For each experiment, we only change one hyperparameter with the others fixed.Figure 7 reports the loss curves on the WMT-17 validation set.It shows that DEEPNET can be trained without difficulty in all the largest settings.The loss of DEEPNET with 1024 hidden size increases after 10K steps because of overfitting.Besides, it indicates that DEEPNET can benefit from the larger settings, resulting in faster convergence and lower validation loss.

Massively Multilingual Neural Machine Translation
We conduct experiments on the large-scale multilingual machine translation, which is a good testbed for large models.We first use OPUS-100 corpus (Zhang et al., 2020) to evaluate our model.OPUS-100 is an English-centric multilingual corpus covering 100 languages, which is randomly sampled from the OPUS collection.We scale DEEPNET up to 1,000 layers.The model has a 500-layer encoder, a 500-layer decoder, 512 hidden size, 8 attention head, and 2,048 dimensions of feed-forward layers.More details can be found in the Appendix.
Table 2 summarizes the results of DEEPNET and the baselines.It shows that increasing the depth can significantly improve the translation quality of NMT: the baseline of 48 layers achieves a gain of 3.2 points on average over the 12-layer model.DEEPNET can successfully scale up the depth to 1,000 layers, outperforming the baseline by an improvement of 4.4 BLEU.It is noted that DEEPNET is only trained for 4 epochs, and the performance can be further improved given more computation budgets.
Scaling law in terms of depth We train DEEPNET of {12, 20, 100, 200, 1000} layers on the OPUS-100 dataset.Figure 8 illustrates the scaling curve.Compared with bilingual NMT, multilingual NMT benefits more from scaling the depth of the model because of its hunger in model capacity.We observe logarithmic growth of the BLEU score for multilingual NMT, and the scaling law can be written as: where d is the depth, and A, B are the constants regarding the other hyper-parameters.
More data and language directions.To explore the limits of DEEPNET on multilingual NMT, we then scale up the training data by using CCMatrix (Schwenk et al., 2021).We also expand the data from CCAligned (El-Kishky et al., 2020), OPUS (Zhang et al., 2020) We compare DEEPNET with the state-of-the-art multilingual NMT model M2M-100 (Fan et al., 2021).M2M-100 has a 24-layer encoder, a 24-layer decoder, and 4,096 hidden size, resulting in up to 12B parameters.Compared with M2M-100, DEEPNET is deep and narrow with only 3.2B parameters.For a fair comparison, we generate the model with beam size 5 and length penalty 1.
Following M2M-100 (Fan et al., 2021), we evaluate the models on several multilingual translation evaluation datasets, including WMT (Bojar et al., 2014;2017;2018;Barrault et al., 2019), OPUS (Zhang et al., 2020), TED (Qi et al., 2018), and Flores (Goyal et al., 2021).The language pairs from the WMT dataset are English-centric.There are 10 languages including English, and most of them are high-resource.For the OPUS dataset, we select the non-English directions from the test set, which has 30 evaluation pairs.The TED evaluation set has 28 languages and 756 directions, and the data is from the spoken language domain.The Flores dataset has all translation pairs between 102 languages.We use a subset covering the languages supported by both M2M-100 and DEEPNET, resulting in 87 languages and 7,482 translation directions.
We report the results in Table 3.For a fair comparison, we use the same evaluation methods as the baseline.The details can be found in the Appendix.It shows that DEEPNET has significantly better performance than M2M-100 on all evaluation datasets, indicating that deepening the model is a very promising direction to improve the quality of NMT models.

Conclusion and Future Work
We improve the stability of Transformer and successfully scale it to 1,000 layers.This is achieved by our DEEPNET with a novel normalization function called DEEPNORM.It has theoretical justification to stabilize the optimization with a constant upper bound for model updates.Experimental results verify the effectiveness of our methods across various benchmarks.We focus on machine translation as a test bed in the current experiments.In the future, we will extend DEEPNET to support more diverse tasks, e.g., language model pre-training (Dong et al., 2019;Bao et al., 2020;Chi et al., 2021a;Ma et al., 2021;Chi et al., 2021b), protein structure prediction (Jumper et al., 2021), and BEiT vision pre-training (Bao et al., 2022;Wang et al., 2021).

A Main Theorem Proof
where Θ = stands for equal bound of magnitude.
Proof.The weight , where θ 2l−1 and θ 2l denote the parameters of self-attention and FFN in l-th layer, and each sub-layer is normalized with DEEPNORM: Our aim is to study the magnitude of model updates.Following Zhang et al. (2019b), we make the following assumptions to simplify the derivations: 1. Hidden dimension d equals to 1.

var(x
3. All relevant weights v, w are positive with magnitude less than 1 and α, β for DEEPNORM are positive with magnitude greater than 1. Given Assumption 1, if G(x) is feed-forward network with θ = {v, w}, then G(x) Θ = vwx.According to Theorem 4.1, the query and key projections do not change the bound of the attention output's magnitude.Therefore, if G(x) is self-attention with θ = {q, k, v, w}, then G(x) Θ = vwx.Especially, if Xavier initialization is used for the projection, then the output can preserve the input variance, which is equivalent to v = w = 1.With Assumption 2, we have: With Equation ( 4), the magnitude of ∂f l ∂x and ∂f l ∂θ l is bounded by: Besides, the model update ||∆F || satisfies: Using Taylor expansion for Equation ( 6), we get: Then, we have: Theorem A.3.Given an encoder-decoder DEEPNET F ed (x, y, θ e , θ d ) with N encoder layers and M decoder layers, where each encoder sub-layer is normalized as x l+1 = LN (α e x l + G el (x l , θ el )), and the decoder sub-layer is normalized as Proof.The derivation of self-attention and FFN layers is given in Appendix A.2.For the crossattention layers, we have: With Equation ( 11), we have the bound of the derivative of f dl : According to Theorem 4.2, we have Therefore, the magnitude of ||∆F ed || satisfies: As a special case, the corresponding parameters in Equation ( 13) for vanilla Post-LN with standard initialization are 1, so its model update

B Derivation for Encoder-Decoder Architecture
Here, we give the derivation of DEEPNET for the encoder-decoder architecture with an N -layer encoder and an M -layer decoder.As in Section 4.3, we have to bound the second term of Equation ( 13) to Θ(η).For the first term, we set v ei = v e , w ei = w e , so that it goes to: C Derivation for Encoder-only (Decoder-only) Architecture For an N -layer DEEPNET, starting from Theorem 4.2 we have, By assumption || ∂L ∂F || = O(1), and Due to symmetry, we set v i = v, w j = w, so it goes to 2N v 2 +w 2 α 2 = 1.In this work, we use v = w = (8N ) − 1 4 and α = (2N )

D.5 Evaluation Details
For IWSLT-14 and WMT-17, we use the in-built BLEU scripts of Fairseq to report the scores.Besides, we report the case-sensitive detokenized BLEU using sacreBLEU (Post, 2018) for the results of OPUS-100.5 For WMT, OPUS, and TED, we use the same test sets and evaluation scripts as in M2M (Fan et al., 2021), and the results of M2M are directly from the paper (Fan et al., 2021).For the Flores-101 evaluation set, we report the spBLEU6 of M2M-12B with the public checkpoint and script.The i-th row is the source language, while j-th column is the target language.There are 87 languages and 7,482 directions.

Figure 3 :Figure 4 :
Figure 3: (a) Gradient norm in the top layers of 18L-18L models.(b) Gradient norm in the last layer of the models with depths varying from 6L-6L to 24L-24L.(c) Validation loss curves of 18L-18L models.
where x and θ i denotes input, and model parameters after i-th updates.Post-LN has an exploding update at the very beginning of training, and then nearly no update shortly.It indicates that the model has been stuck in a spurious local optima.Both warm-up and better initialization help alleviate this issue, enabling the model to update smoothly.When the update explodes, the inputs to LN become large (see Figure4(b) and Figure4(c)).According to the theoretical analysis fromXiong et al. (

Figure 4
Figure 4(b) and Figure 4(c) show that ||x|| is significantly larger than √ d (d = 512) without warm-up or proper initialization.This explains the gradient vanishing problem occurred in the training of Post-LN (see Figure 4(d)).Above all, the instability starts from the large model update at the beginning of training.It renders the model trapped in a bad local optima, which in turn increases the magnitude of inputs to each LN.As training continues, the gradient through LN becomes increasingly small, thus resulting in severe gradient vanishing.The vanishing gradients make it difficult to escape from the local optima, and further destabilize the optimization.On the contrary, Post-LN-init has relatively small updates, and the inputs to LN are stable.This relieves suffering from gradient vanishing, making optimization more stable.

Figure 5 :
Figure 5: Model updates of vanilla Post-LN and DEEPNET at the early stage of training.The visualization is conducted on 64-128-2 tiny Transformers with depth varying from 6L-6L to 100L-100L.It shows that DEEPNET has much smaller and more stable updates than Post-LN.

Figure 6 :
Figure 6: BLEU scores on the IWSLT-14 De-En test set for different deep models with varing depth from 10L-10L to 100L-100L.

Figure 8 :
Figure 8: Average BLEU scores for DEEPNET with varying depth on the OPUS-100 En-X and X-En test sets.

Figure 10 :
Figure10: Evaluation results of 3.2B DEEPNET on a subset of FLORES-101 devtest set.The i-th row is the source language, while j-th column is the target language.There are 87 languages and 7,482 directions.

Table 2 :
Average BLEU for DEEPNET and the baseline on the OPUS-100 test sets.

Table 3 :
BLEU scores for DEEPNET and M2M-100 on various evaluation sets.
, and Tatoeba 4 to cover all languages of Flores101 evaluation sets.The final data consists of 102 languages, 1932 directions, and 12B sentence pairs.With the data, we train DEEPNET with a 100-layer encoder, 100-layer decoder, 1,024 hidden dimension, 16 heads, and 4,096 intermediate dimension of feed-forward layers.More details can be found in the Appendix.

Table 4 :
Hyperparameters for the machine translation experiments on the IWSLT-14 De-En dataset.

Table 5 :
Hyperparameters for the base-setting experiments on the WMT-17 En-De dataset.

Table 6 :
Hyperparameters for the large-setting experiments on the WMT-17 En-De dataset.

Table 7 :
Hyperparameters for the machine translation experiments on the OPUS-100 dataset.

Table 8 :
Hyperparameters for the machine translation experiments on the 102-language dataset.
Figure9: Evaluation results of 12B M2M-100 on a subset of FLORES-101 devtest set.The i-th row is the source language, while j-th column is the target language.There are 87 languages and 7,482 directions.
7E Experimental Results in Section 6