LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence

Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 8~16k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 16~48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.


I. INTRODUCTION
Transformer [1] models have achieved dominant results in numerous text-oriented tasks and have become the foundational architecture in natural language processing (NLP).However, standard Transformer models [1], [2], [3], [4], [5] with full attention cannot efficiently handle long texts due to their quadratic complexity in relation to the input length, and consequently cause the maximum input length of 1∼4k for lowering computational consumption.To relieve the limitation, considerable efforts of research called efficient Transformers [6] have been done in past few years, which The associate editor coordinating the review of this manuscript and approving it for publication was Okyay Kaynak .
including transformers with sparse attention and transformers with hierarchical or recurrent architectures.
As the main category of efficient Transformers, Transformers with sparse attention have a key feature that their attention matrices are sparse, which means that only a small number of selected tokens, instead of all tokens in the input sequence, are attended to during the attention computation.According to the different ways of sparsification, it is called local attention [7] if we select neighbor tokens of the query token as the attended tokens, and it is called global attention [8] if we select shared special tokens as them.With the help of local and global attention alone or jointly, efficient Transformer models such as LED [7], Bigbird [9], LongT5 [10] have achieved state-of-the-art results in long text understanding and generative NLP tasks with relatively low resource consumption, and have extended the length limit from 4k to 16k.But unfortunately, on one hand, models with global attention fail to process longer documents in mainstream hardware conditions due to their essentially quadratic computational complexity, which results in a lack of efficiency.And on the other hand, other models with only local attention can process much longer documents but cannot effectively capture long-range dependencies, which results in reduced accuracy.
As another category, transformers with hierarchical or recurrent architectures [11], [12], [13], [14], [15] are based on the concept of divide-and-conquer.They follow a typical procedure that involves splitting the input sequence into multiple parts, processing them with full attention one-byone, and then summarizing the results from each part or historical memory.These efforts have achieved competitive results compared to other efficient Transformers with both efficiency and accuracy.However, they significantly alter the architecture of Transformers and cannot be easily applied to existing pre-trained Transformers without retraining the entire model from scratch.This demonstrates their relatively low compatibility.
Summarizing the previous works mentioned above, it is evident that current efficient Transformers face challenges in terms of sacrificing either efficiency, accuracy, or compatibility.These challenges hinder their application in longer sequences.Therefore, exploring how to extend the length limit of efficient Transformers while maintaining a well-balance among these three factors is still necessary.
To address this research gap, in this paper, we propose a new sparse attention mechanism called multi-level local attention (Mulla attention), which has linear or log-linear attention complexity and makes efficient Transformer models have better performance on long-text NLP tasks.To achieve this, Mulla attention fuses local attention and global attention into a unified but hierarchical local attention that acts on the input sequence and multiple pooling sequences of different granularity, without introducing a large amount of extra parameters except for positional embeddings weights with negligible parameters, and can be directly applied in existing pre-trained Transformer models.To construct an efficient Transformer model with Mulla attention, we start with pretrained LongT5 [10] sequence-to-sequence (seq2seq) model by replacing the original attention modules in the Encoders with Mulla attention thanks to the compatibility of sparse attention, and thus obtain our LongT5-Mulla model.
We conduct experiments in the 8∼16k-input long text summarization task, and the results show that our LongT5-Mulla model exceeds all baseline models, including two original variants of LongT5, on the Multi-News [16], arXiv [17], and WCEP-10 [18] datasets, which demonstrates the great ability of Mulla attention to process long sequences.
To further analyze the properties of the LongT5-Mulla model with Mulla attention on longer sequences with length from 16k to 48k, we first conduct experiments on the memory consumption and inference speed of LongT5 models with different attention modules, and then study the performance variation on these models with different lengths of input sequences of inference.Finally, these experiments validate the advantages of our model in processing longer sequences while simultaneously maintaining a balance among efficiency, accuracy, and compatibility.
The main contributions of our work are: • We propose a new attention mechanism, Multi-level Local Attention (Mulla attention), which achieves linear or log-linear attention complexity based on hierarchical sparse attention mechanism.
• We present a new Transformer seq2seq model, LongT5-Mulla, which is based on Mulla attention and achieves state-of-the-art results in 8∼16k-input text summarization task on three common long text datasets.
• We conduct further studies to discover the property of Mulla attention on longer sequences with length of 16∼48k, including efficiency and performance variance, and verify the ability of Mulla attention to effectively process longer sequences.

II. RELATED WORK A. TRANSFORMERS WITH SPARSE ATTENTION
Standard Transformer [1] adopts the full attention mechanism, that is, each token in the input sequence needs to be calculated attention from all the other tokens, and consequently causes O(N 2 ) complexity and consumes huge resources when processing long sequences.For this reason, these models that are built on vanilla Transformer, for example, BERT [2], BART [3], GPT2 [5], and T5 [4], generally process no more than input length of 1∼4k at a time.
To deal with this problem, a lot of research has been done to make the attention mechanism more efficient and have lower complexity, so called efficient Transformers [6].Except for low-rank kernel optimizations such as Linformer [19], Performer [20], and down-sampling models such as Charformer [21], BP Transformer [22], a great branch of this research is about Transformers with sparse attention, which directly deletes some attention relations from the original attention computation, and makes the attention matrix appear sparse.
In these contributions to Transformers with sparse attention, LED [7] and ETC [8] concurrently propose sliding window local attention and a small-scale variant of full attention called global attention, which makes attention focus only on neighboring and selected tokens.BigBird [9] jointly uses local, global and random attention to cover as much useful information as possible over short and long distances, and achieves state-of-the-art results on many question answering and summarization tasks.LongT5 [10] proposes a modified local-global attention called transient global attention that dynamically constructs global tokens and discards them after the attention operation.However, a major challenge of these works is how to maintain efficiency and accuracy while scaling up the input length.Because in the scenarios of longer sequences with length of 16∼48k, they are either computationally intensive due to a linear increasing global memory (e.g.global attention), or informationally lossy because of a fixed pattern that only fits for common long texts of 8∼16k (e.g.local attention and random attention).
To tackle this problem, we utilize the most simple but with great potential type of sparse attention [23], local attention, as the foundation of our design, and incorporate a lightweight and self-adaptive hierarchical structure to enhance its ability to capture long-range dependencies, while largely preserving its efficiency with longer sequences.

B. LONG-TEXT MODELING
Many works show that it is not necessary to maintain a standard Transformer architecture to process long text sequences as an efficient long text model, which utilizes either recurrent or hierarchical architectures to achieve this.
As one research route, Transfomer-XL [14] and Compressive Transformer [15] employ recurrent architectures derived from basic Transformer modules that can block-wise process text pieces and retain historical information by memory mechanism, which makes the models keep a relatively narrow range of attention in each computation.
As another research route, DANCER [11], Hi-Transformer [13], SummˆN [12], and SLED [24] design different hierarchical architectures that can similarly block-wise process small parts of input text, and their main difference to recurrent patterns is that they utilize high-level Transformers to summarize the results from low-level Transformers in an end-to-end or multi-stage way.These models can handle common long texts but may be limited by the complexity of high-level Transformers.
In addition, some works even explore methods without Transformers.For example, S4 [25] can parallel train as convolutional neural networks and infer sequentially as recurrent neural network based on the structured state space model and HiPPO theory [26], which can effectively process long sequences as other Transformer-based models.
In summary, these works explore the possibility of designing an efficient architecture instead of improving the attention method itself.However, an alternative architecture usually requires the design of additional modules with heavy coding at both the hardware and software levels, and this can make it challenging to transfer the architecture to deep learning libraries for downstream applications.Besides, it is also impossible to develop it from an existing pre-trained large language model due to their completely different architectures, which highlights the lack of compatibility.
In contrast, we draw inspiration from hierarchical models and introduce Mulla attention, a new variant of local attention with an internal hierarchical structure, which can be easily applied to existing pre-trained efficient Transformers that already have local attention modules with small-scale continued pre-trained or directly fine-tuned.And due to its compatibility, we can finally apply it to LongT5 and construct our LongT5-Mulla model.

III. METHODS
In this section, we introduce the methodology of Mulla attention and LongT5-Mulla model.

A. PRELIMINARY: LOCAL ATTENTION
Local attention (also known as sliding window attention) [7] is a sparse and lightweight form of attention.Unlike full attention, where each token attends to all other tokens in the sequence, local attention only considers the immediate right and left neighbors of each token within a fixed range.That is, the attention queries are generated from all tokens, but the attention keys and values are selected from the neighboring tokens of each query token.
Local attention is a linear complexity operation because the number of local neighbors (also known as the local radius) remains constant.However, the consequence of this trade-off is that local attention cannot capture long-range information.

B. MULTI-LEVEL LOCAL ATTENTION
To expand the scope of local attention and avoid applying global attention by linearly selecting global tokens for maintain long-range dependency, we propose multi-level local attention (Mulla attention), a hierarchical local attention that can perform both short-range and long-range modeling.
The main procedure of Mulla attention can be summarized in two steps: the pooling step and the attention step.In the pooling step, the input sequence is pooled at exponentially increasing rates, and the pooling tokens of each token are marked layer-by-layer.In the attention step, it performs multiple local attentions centered on input tokens and pooling tokens simultaneously on the input and pooling sequences with the same local radius.
Figure 1 shows the comparison between local attention and Mulla attention on attention matrices and mechanisms.As shown in the figure, local attention only acts on the area centered on the query tokens with a high ratio of missing attention area, while Mulla attention is a combination of multiple local attentions with gradually compressed attention areas, starting from high to low ratio of missing attention area.Because the compressed attention areas are generated from multi-level pooling tokens that contain information from different distances, the multiple local attentions performed on the pooling sequence are much more efficient than those performed on the original input sequence.In other words, Mulla attention allows the model to capture dependencies between query tokens and key-value tokens from nearby to distant ones, with a resolution that goes from high to low, as opposed to local attention, which only focuses on nearby key-value tokens.Now we give a formal and mathematical definition of Mulla attention.We assume that there is an input token FIGURE 1. Attention matrices and mechanisms of local attention (left) and Mulla attention (right).Suppose the length of the input sequence is 8, the local radius is 1, the pooling rate is 2, and the layer number of Mulla attention is 3.The local attention performs a single layer attention that each query token only attends to its neighbor tokens, while the Mulla attention constructs two pooling layers to simultaneously perform three local attentions on three different sequences.Although all the layers of Mulla attention have the same local radius, the upper pooling layer can provide a higher attention area ratio through a shorter pooling sequence.
sequence x with a length of N and a hidden size of H that is denoted as (1), where each token x i is embedded as an H -dimensional float number vector, and we regard it as the sequence of the first layer and can also denote it as x l=1 .For Mulla attention, we define the following parameters: the layer number L, the local radius r, and the pooling rate K .The layer number determines the number of layers (including the input layer) that participate in the attention operation simultaneously.The local radius determines the range of neighbor tokens that should be attended to for each query token.And the pooling rate defines how many tokens from the lower layer are pooled to form a single token in the upper layer during the pooling step.
First, in the pooling step of Mulla attention, the input token sequence is pooled by averaging every K , K 2 , . . ., K L−1 tokens for different pooling layers, and it results in L − 1 pooling sequences x l=2 , . . ., x l=L .Equation (2) shows how the pooling sequence with pooling tokens can be recursively computed, starting from the input sequence with l = 1 that is in the lowest layer, where j is the current layer number, ⌉ is the length of the sequence in layer j, and each token x l=j i in the pooling sequence is computed by averaging K specific tokens from the sequence in the lower layer.
Second, in the attention step of Mulla attention, it performs multiple local attentions simultaneously by constructing the attention queries from the input sequence x l=1 , while the attention keys and values are obtained from neighbors of both the input sequence x l=1 and pooling sequences x l=2 , . . ., x l=L .Similar to local attention, it is easy to define the neighbors from the same layer j of each token x l=j i with a local radius r, which is shown as (3).To determine the neighbors from different layers of each token x l=1 i in the input sequence x l=1 , we define the recursively pooling tokens of x l=1 i in the pooling step as its proxies, which can be expressed by (4).And finally, we combine with the neighbors from the same layer as well as the neighbors of proxies from different layers, as all of the neighbors of the input token x l=1 i , which derives (5).It is mentioned that we use the padding token as a replacement to deal with cases that have out-of-boundary neighbors.

layerNeighbor(x
After completing these two steps, we can construct the attention matrices for queries, keys, and values by projecting the corresponding tokens using linear projections Q, K , V .We can then generate output sequences by performing weighted averaging based on attention scores and value tokens, similar to a typical full attention mechanism [1].
138436 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. TWO VARIANTS OF MULLA ATTENTION
The layer number of Mulla attention can be fixed or dynamic.The fixed version includes a fixed number of pooling layers, each with its own relative positional embeddings.And the dynamic version can automatically select the most appropriate layer number by continuously pooling the input sequence until the shortest pooling sequence becomes shorter than the local radius, with shared positional embeddings during both the training and inference stages.
According to the hyperparameter search results in Appendix A, for the fixed version, we choose the pooling rate of K = 4 and the layer number of L = 3 as the structure setting, and for the dynamic version, we choose K = 8 and a dynamic layer number.In addition, as recommended by [10], we use a local radius r = 127 for each local attention layer maintain a balance between model accuracy and efficiency.
Figure 2 shows the relationship between the number of dynamically constructed pooling layers and the input sequence length when we select r = 127 and K = 8.In this figure, we can find that the layer number automatically increases by 1 when the input length increases by K times, and such a logarithmic increasing layer number based on the input length provides good scalability in processing longer sequences.O(rN + N2 /K ), because it introduces N /K global tokens as shared neighbors for each token.

D. COMPLEXITY
As a contrast, the complexity of Mulla attention is O(rNL) and O(rNlog K (N /r)) for the fixed and dynamic version, respectively.For the fixed version, we can observe that it is equivalent to computing L different local attentions at one time, then the complexity is O(rNL).For the dynamic version, the layer number is determined by the number of rounds required to continuously pool the input sequence until the shortest pooling sequence becomes shorter than the local radius, so it has log K (N /r) layers, and the complexity is From the comparison, we can see that our proposed Mulla attention can avoid quadratic complexity in full attention and transient global attention, and this results in certain advantages in processing longer sequences.It is worth noting that the cross-attention and selfattention mechanisms in the Decoders continue to maintain full attention, taking into account the concept of fusion-indecoder [24] and the relatively short target sequences of downstream tasks.

IV. EXPERIMENTS
In this section, we introduce how to implement Mulla attention and LongT5-Mulla model, and evaluate the LongT5-Mulla model on three long text summarization datasets with common settings of 8∼16k long text tasks.

A. IMPLEMENTATION
We implement Mulla attention and LongT5-Mulla model by Pytorch 1 and Huggingface Transformers 2 Library.For Mulla attention, as shown in Appendix B, we adapt the algorithms  34 instead of training from scratch.To fairly and efficiently evaluate our models, we consider models of two sizes: base (∼220M) and large (∼770M), and inherit the sentence-piece [27] tokenizer used by T5 v1.1 and LongT5, which has a vocabulary size of 32k.

B. FINE-TUNING
We fine-tune our models with a constant learning rate scheduler without a warm-up or decay setting.For hyperparameters, we select a learning rate of 1e-3 and a global batch size of 128 with Adafactor [28] optimizer and set the dropout rate to 0.1 following the same configuration from LongT5.To speed up the fine-tuning process, we use bf16 to train and evaluate our models by 4 Nvidia A100-40G GPUs, and train on no more than 800k samples or 10 epochs to avoid too long training time on some large datasets.Based on the average length of the input sequence of these long text summarization datasets, we set different max input lengths of 8192 or 16384 tokens to make the training more efficient.When generating, we choose to perform greedy decoding instead of beam searching according to the selection of previous works [10], [29].

C. DATASETS
We fine-tune and evaluate our model on long text summarization tasks with the following three datasets.Table 2 shows the basic statistical information of our used datasets, including the sample count of train, evaluation, and test set, as well as the average, medium, and maximum input and output lengths of these datasets.

1) MULTI-NEWS
It is a large-scale news summarization dataset with news articles and corresponding professional human-written summaries of these articles from the Internet [16]. 3https://huggingface.co/google/long-t5-tglobal-base 4 https://huggingface.co/google/long-t5-tglobal-large TABLE 3. Summarization results on the Multi-News dataset test set of LongT5-Mulla models and baseline models.All models are large size, and are supervised fine-tuned and tested with a maximum input length of 8192 and a maximum output length of 512.The scores of baseline models without stars are from [30], [31], [32], [33], [29], [34], [10] from top to bottom respectively.

2)
This dataset selects scientific publications from the free online science pre-print repository arXiv, and uses articles as source texts and abstracts as target texts [17].

3) WCEP-10
This is a dataset for multi-document summarization that collects data from the Wikipedia Current Events Portal (WCEP).In our paper, we use the specific version called WCEP-10, where 10 documents in each news cluster are reserved instead of at most 100 documents in the original dataset, and it finally gets thousands of pairs of news events as source texts and their human-written summaries as target texts [18].
Table 3, 4, and 5 shows the results of the long text summarization task on the Multi-News, arXiv, WCEP-10 datasets, respectively.In these tables, the baseline models with stars are trained and evaluated by ourselves using the same hyperparameters as LongT5-Mulla, because their original papers do not report such results.And scores in bold represent the best results for each metric, while scores with underlines represent the second-best results.
138438 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.TABLE 4. Summarization results on arXiv dataset test set of LongT5-Mulla models and baseline models.All models are base size, and are supervised fine-tuned and tested with a maximum input length of 16384 and a maximum output length of 512.The scores of baseline models without stars are from [7], [9], [9], [11], [35], [36] from top to bottom respectively.TABLE 5. Summarization results on WCEP-10 dataset test set of LongT5-Mulla models and baseline models.All models are large size, and are supervised fine-tuned and tested with a maximum input length of 8192 and a maximum output length of 512.The scores of BART-DYNE-5 are from [37], the scores of RL-RELAX are from [38], and the scores of LED and LED-UPER are from [39].
For the Multi-News dataset, our LongT5-Mulla(dynamic) model outperforms 6 out of 8 baseline models in all three types of Rouge score, and have a difference of −0.27, −1.61, +0.20, +0.22 percentage points (pp) compared to the best Rouge-1, Rouge-2, Rouge-L, and averaged Rouge score that appears in the baseline models.Furthermore, we can observe that LongT5-Mulla(dynamic) outperforms LongT5-Mulla(fixed) with an average Rouge score improvement of +0.51.And this improvement can be attributed to the better scalability of dynamic Mulla attention.
For the arXiv dataset, our LongT5-Mulla(dynamic) model outperforms 7 out of 8 baseline models in all three types of Rouge score, and have a difference of +0.14, −0.11, +0.02, +0.01 pp compared to the best Rouge-1, Rouge-2, Rouge-L, and averaged Rouge score that appears in the baseline models.
And for WCEP-10 dataset, our LongT5-Mulla(dynamic) model outperforms all 6 baseline models in all three types of Rouge score, and have a difference of +0.91, +0.34, +0.32, +0.52 pp compared to the best Rouge-1, Rouge-2, Rouge-L, and averaged Rouge score that appears in the baseline models.
From these quantitative results, we can conclude that our LongT5-Mulla(dynamic) model achieves state-of-theart results with improvements of at least +0.22,+0.01, TABLE 6. Memory consumption (Max 40.1GB) when fine-tuning LongT5-tglobal and LongT5-Mulla by 4 Nvidia A100-40G GPUs with different input lengths.We report the maximum value of memory consumption among all devices as a result.
+0.52 pp of averaged Rouge scores on the Multi-News, arXiv, and WCEP-10 datasets, respectively, in the 8-16k input long text summarization task, while it still has competitive results if we consider each type of Rouge score individually.This demonstrates the high accuracy of our model in processing long sequences, with the ability to effectively capture long-range dependencies using Mulla attention.

V. ANALYSIS
Benefit from the lower complexity of Mulla attention and consequently lower computing consumption, LongT5-Mulla model actually can train and predict in longer sequences than models for common long sequences under the same hardware condition.In this section, we further explore the properties of the LongT5-Mulla model with Mulla attention in common long sequences of 8∼16k lengths and longer sequences of 16∼48k lengths.

1) INPUT LENGTH VS MEMORY CONSUMPTION
To analyze the memory consumption of our model, we perform a series of fine-tuning test runs on numerous sufficiently long samples, using scaled-up model input lengths ranging from 8k to 48k, for the two variants of LongT5 including LongT5-Mulla and LongT5-tglobal.We set the maximum output length to 512 and use a micro batch size of 1.Throughout the process, we record the memory consumption, which is printed in the logs.
As shown in Table 6, we can find that our LongT5-Mulla model has competitive memory consumption with a difference of 0.0∼1.0GBand 0.0∼0.1GBmemory reduction for base and large size of model, respectively, when the input length is 8∼16k.Furthermore, our model has a steadily increasing memory consumption and can effectively avoid out-of-memory if we further increase the input length to 32k or 48k.In contrast, the original LongT5 with transient global attention becomes overloaded when processing longer sequences.These results show that LongT5-Mulla significantly expands the length limit of LongT5-tglobal by at least 1.5 times under the same hardware condition.
To better understand the memory cost of attention modules rather than focus on the whole models, we give a theoretical figure of the relationship between input sequence length and memory cost of each token in the attention module under their current implementations, which is shown in Figure 3.In this figure, we can see that LongT5-Mulla behaves like LongT5-local that has a relatively low and steady memory consumption, where LongT5-Mulla only increase its consumption of each token when it reaches the boundary condition to construct a new pooling layer for a longer sequence.And for LongT5-tglobal, we can observe that the memory cost of each token linearly increases due to the linear selection of global tokens based on length, and it becomes very large in scenes that have longer sequences with lengths of 16∼48k.This could explain why the memory cost of LongT5-Mulla is only half that of LongT5-tglobal when the input length increases to 32k, and we conclude that LongT5-Mulla effectively processes longer sequences that have 16∼48k tokens with at least 52.6% lower memory consumption compared to LongT5-tglobal.
It is worth noting that LongT5-Mulla is not always more efficient than LongT5-tglobal when the sequence has a short length.As discussed in Appendix B, suppose d = r + 1, for each token in the input sequence, local attention needs to compute 3d tokens as keys and values, transient global attention needs to compute 3d + N /K tokens, and Mulla attention needs to compute 3dL tokens.When the input sequence is relatively short, where N < 3dK (L −1), LongT5-Mulla with Mulla attention may consume more memory than LongT5-tglobal with transient global attention.

2) INPUT LENGTH VS SPEEDUP
To analyze another aspect of efficiency, we also perform a series of test runs focusing on samples per second to TABLE 7. Samples per second of LongT5-tglobal and LongT5-Mulla and relative speedup between them with different maximum input lengths by 1 Nvidia A100-40G GPU, where zero means out-of-memory when inference.
examine the inference speedup of LongT5-Mulla compared to LongT5-tglobal with input lengths ranging from 8k to 48k.
As shown in Table 7, we can see that LongT5-Mulla model has a similar inference speed to LongT5-tglobal with speedup ranging from -2.9% to +18.6% in 8k and 16k long sequences, respectively, but it has a higher speedup that ranging from +5.7% to infinity if we increase the sequence length to 48k, due to the out-of-memory problem of LongT5-tglobal.These results demonstrate that LongT5-Mulla has better efficiency in 16∼48k longer sequences.

3) INPUT LENGTH VS PERFORMANCE
To evaluate and analyze the performance of the finetuned LongT5 models, including LongT5-local, LongT5-Mulla(fixed), and LongT5-Mulla(dynamic), during inference with different input lengths, we conduct experiments to generate summaries on 130 test samples that have the longest 138440 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.lengths in the Multi-News dataset, with different maximum input lengths.
As shown in Table 8, we calculate Rouge scores for generative results, where scores with underlines represent the best results of each model, while scores in bold represent the best results among those with underlines.In this table, we can find that both two versions of LongT5-Mulla can benefit from the longer input length, with improvements of +1.18, +1.97 pp averaged Rouge scores from 8k to 48k for LongT5-Mulla(fixed) and LongT5-Mulla(dynamic), respectively.In contrast, LongT5-local can benefit from the longer input length initially, but it reaches its limit of Rouge at a length of 32k.We infer that this is because a longer input length provides more evidence, but the precondition for the model to extract more information is the ability to capture long-range dependencies, which demonstrates the superiority of LongT5-Mulla over LongT5-local.
Furthermore, when compared to LongT5-local models, LongT5-Mulla(fixed) has competitive results with a difference of -0.37∼+0.73pp averaged Rouge scores with a length of 8∼48k, while LongT5-Mulla(dynamic) receive significantly better performance with a difference of +0.56∼1.62 pp averaged Rouge score.This improvement of LongT5-Mulla(dynamic) can be attributed to its larger attention range compared to LongT5-local and LongT5-Mulla(fixed) attention.
These results show the great ability of our model, especially LongT5-Mulla(dynamic), to effectively process 16∼48k longer sequences with high accuracy using Mulla attention.

VI. CONCLUSION
In this paper, we propose Mulla attention, a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, and we apply this mechanism to the pre-trained LongT5 model to construct our LongT5-Mulla model.Experiments show that our model achieves state-of-the-art results in the 8∼16k input long text summarization task with improvements of at least +0.22,+0.01, +0.52 pp of averaged Rouge scores on the Multi-News, arXiv, and WCEP-10 datasets, respectively.Further studies show that our model can effectively process longer sequences that have 16∼48k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56∼1.62 pp averaged Rouge scores higher than LongT5-local.These experiments and studies demonstrate that our proposed LongT5-Mulla model can effectively process 8∼16k long sequence and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency, where our model provides a feasible solution for NLP tasks involving longer sequences.
As a limitation, we have not attempted to incorporate Mulla attention into a larger model with billions of parameters due to our limited resources, and we have not explored the potential of Mulla attention in a Decoder-only Transformer architecture (models such as BART, LED, and LongT5 are Encoder-Decoder architectures, while models such as GPT2 are Decoder-only architectures).For future work, we would like to conduct additional research to further explore them.

APPENDIX A ABLATION STUDY ON HYPERPARAMETERS OF MULLA ATTENTION
We conduct an ablation study on the hyperparameters of Mulla attention to determine the most appropriate layer number and pooling rate.
In this study, we fine-tune LongT5-Mulla on the Multi-News dataset for 10 epochs with the hyperparameters reported in section IV, and observe how Rouge scores change.We experiment with different structure settings, varying the layer number from 2 to 4 and the pooling rate from 4 to 8, in order to find the most suitable one.
Table 9 shows the 8k-input long text summarization results on the Multi-News dataset of the LongT5-Mulla model with different layer numbers and pooling rates, where scores in bold represent the best results for each metric while scores with underlines represent the second-best results.As shown in the table, LongT5-Mulla(fixed) with layer number of 2 and pooling rate of 8 as well as LongT5-Mulla(fixed) with layer number of 3 and pooling rate of 4 are the best two models among the fixed version of LongT5-Mulla, with an improvement of at least +0.17, +0.21, +0.11, +0.19 pp Rouge-1, Rouge-2, Rouge-L, and averaged Rouge score compared to other settings respectively.And we can see that, LongT5-Mulla(dynamic) has a better performance than LongT5-Mulla(fixed) with an improvement of at least +0.83, +0.37, +0.19, +0.48 pp Rouge-1, Rouge-2, Rouge-L, and averaged Rouge score.
From these results, we can infer that for the fixed version of Mulla attention, a smaller pooling rate usually performs well with a larger layer number, while a larger pooling rate is usually suitable for a smaller layer number.Besides, increasing the layer number does not seem to provide additional benefits.This may be due to the increased difficulty in training the model with more newly introduced positional embeddings for the fixed version of Mulla attention.
Finally, we choose layer number of 3 and pooling rate of 4 as the structural settings for the fixed version of Mulla attention, due to its richer hierarchical structure and competitive results.And for the dynamic version, we select a pooling rate of 8 to prevent the module from constructing too many pooling layers in long text scenes.

APPENDIX B IMPLEMENTATION OF MULLA ATTENTION
Compared to full attention, sparse attention needs to construct different keys and values sequences for each query token if we do not perform any optimization, and consequently leads to O(N 2 H ) huge keys and values matrices with unacceptable memory costs instead of original O(NH ), where the input length is N and the hidden size is H .For this reason, how to implement sparse attention efficiently is as important as how to design it.
In Figure 4, we show the implementations of three sparse attention methods including our Mulla attention.
For local attention, ETC [8] proposes a sliding window algorithm that first groups tokens and then performs attention by group.Suppose the local radius is r, first it sequentially group the input sequence into grouped sequences by a group size denoted as d = r + 1.Then, for each grouped sequence, it concatenates the d nearest left neighbor tokens and right neighbor tokens with the grouped sequence into an augmented grouped sequence with a length of 3d.After that, it performs full attention to each group, where the grouped sequence is projected into the query matrix, and the augmented grouped sequence is projected into the key and value matrices, and masks specific elements on each row of the attention score matrix to restrict the activated local area to the length of 2r + 1 before computing the Soft-Max.Finally, it collects output tokens from the of attention of each group, and reshapes them into the output sequence.In this process, it has N /d groups, each group has key and value matrices with a shape of 3d × H , so it totally consumes 3NH memory space, which is O(NH ).
For transient global attention, LongT5 [10] inherits the implementation from ETC, and the main difference is that in each group, it constructs a pooling sequence consisting of global tokens, and concatenates the pooling sequence to the augmented grouped sequence before performing attention.Note that each group has key and value matrices with a shape of (3d + N /K ) × H , where K is the pooling rate, so the complexity of memory consumption is O(N 2 H ). In spite of this, it is usually affordable in 8∼16k long text scenes.
138442 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
And for Mulla attention, we combine the techniques of local attention and transient global attention, and design our implementation that it first pools the input sequence into multiple pooling sequences and then groups them into N /d groups, where each group has one grouped sequence from the input sequence and multiple augmented group sequences from the input sequence and pooling sequences.Due to the pooling sequences are shorter than the input sequence, when grouping with the same group length d, the pooling sequences produce fewer augmented grouped sequences than the input sequence, and those augmented grouped sequences cannot be directly allocated to N /d groups.Therefore, for the augmented grouped sequences belonging to different pooling sequences x l=2 , x l=3 , . . ., x l=L , they are respectively replicated K , K 2 , . . ., K L−1 times, to ensure that each pooling sequence has the same number of augmented grouped sequences N /d as the input sequence.After that, when performing attention by group, it first concatenates all the augmented grouped sequences of each group, then processes similarly to local attention and generates the output sequence.Because there are log K N augmented grouped sequences in each group, the complexity of memory consumption is O(NH log N ), and it is better than that of transient global attention, especially in scenes of longer sequences.

FIGURE 2 .
FIGURE 2.The relationship between the layer number of dynamic Mulla attention and the input length, where the local radius is 127 and the pooling rate is 8.

E
. FROM MULLA ATTENTION TO LONGT5-MULLA To construct the LongT5-Mulla model, we incorporate Mulla attention into LongT5 by simply replacing the transient global attention in the Encoders with Mulla attention, while keeping the basic transformer Encoder-Decoder architecture unchanged.This demonstrates the strong compatibility of Mulla attention.For the two versions of Mulla attentions, we derive two variants of LongT5-Mulla: LongT5-Mulla(fixed) and LongT5-Mulla(dynamic).

FIGURE 3 .
FIGURE 3.Relative memory consumption of each attention modules based on their implementations.We assume the memory consumption of LongT5-local of each token at the length of 1024 is 1.Here K means the pooling rate.

TABLE 9 .
Summarization results on Multi-News dataset test set of LongT5-Mulla with Mulla attention of settings of different layer numbers and pooling rates.All models are large size and are trained and tested with a maximum input length of 8192 and a maximum output length of 512.R-1, R-2, R-L represent Rouge-1, Rouge-2, Rouge-L, respectively.Dyn means the dynamic version of LongT5-Mulla.

FIGURE 4 .
FIGURE 4. Implementations of local attention (top), transient global attention (middle) and Mulla attention (bottom).Suppose the input length is 16, the local radius is 3, the pooling rate is 2 for transient global attention and Mulla attention, and the layer number is 3 for Mulla attention.x 0 is the padding token.

Table 1
[10]s the parameters number and complexity of Mulla attention and different attention methods from[1],[7],[10].As shown in the table, all the methods have O(H 2 ) parameters because they use a fixed number of linear projections to construct queries, keys, and values tokens, where each is a H × H matrix and it results in O(H 2 ).For complexity, full attention has a complexity of O(N 2 ) that each token in the input sequence attends to all other other token.Local attention has a complexity of O(rN ) that each token in the input sequence only attends to r neighbors.And the combination of local attention and global attention from LongT5, called transient global attention, has a complexity of

TABLE 1 .
Parameters number and complexity of different attention methods, where r is the local radius, N is the input length, K is the pooling rate, and L is the layer number of the fixed version of Mulla attention.

TABLE 2 .
Basic statistics information of the long text summarization datasets.Input and output lengths are counted by tokens.

TABLE 8 .
Rouge scores of generative summaries on 130 samples with longest input length in Multi-News test set with different generative input length, based on fine-tuned LongT5-local and LongT5-Mulla with large size and training input length of 8192.