Off-Chip Memory Allocation for Neural Processing Units

Many modern Systems-on-Chip (SoCs) are equipped with specialized Machine Learning (ML) accelerators that use both on-chip and off-chip memory to execute neural networks. While on-chip memory usually has a hard limit, off-chip memory is often considered large enough to hold the network’s inputs, outputs, weights, and any intermediate results that may occur during model execution. This assumption may not hold for edge devices, such as smartphones, which usually have a limit on the amount of memory a process can use. In this study, we propose a novel approach for minimizing a neural network’s off-chip memory usage by introducing a tile-aware allocator capable of reusing memory occupied by parts of a tensor before the entire tensor expires. We describe the necessary conditions for such an off-chip memory allocation approach and provide the results, showing that it can save up to 33% of the peak off-chip memory usage in some common network architectures.


I. INTRODUCTION
In recent years, significant advances have been made in Deep Learning in several areas.DL models have achieved great accuracy in many computer vision tasks, including image classification, semantic segmentation, superresolution, object recognition, and others [1], [2], [3], [4], [5], as well as in other domains, such as Natural Language Processing (NLP), Speech Recognition [6], [7], [8] and natural language generation (NLG) [9], [10], [11].The improved accuracy of these models comes at the cost of an increased number of parameters and size of the feature maps.Therefore, reducing the amount of memory used to execute a model has become increasingly important.
The commercialization of these DL models prompts many companies to develop specialized AI hardware, whose main purpose is to reduce inference latency or decrease energy consumption.These accelerators are often referred to as Neural Processing Units (NPUs).They can be installed on edge devices (such as mobile devices, embedded solutions, wearables, or IoT devices with microcontrollers) to locally The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino .execute the DL model.This edge-computing solution preserves data privacy and provides real-time processing [12].The benefits of a dedicated AI accelerator include reduction of load on the CPU, GPU, and main memory, ensuring stable performance, and speeding up inference.
Effective memory management is crucial for edge devices because of their resource-constrained environments.These devices often have weaker CPUs and are equipped with flash memory more than RAM, making them incapable of efficiently processing neural networks.Addressing the task of reducing memory consumption can enhance the device's smart functionality by enabling the execution of more complex neural networks or NN ensembles on the device.Moreover, it can decrease the size and cost of the target product by reducing the memory requirements.Even high-end mobile devices have system-level mechanisms to manage memory usage, which can lead to the termination of apps that exceed memory limits, thereby affecting the complexity and performance of deployable NN models.
Reduction of the main memory load is possible because most accelerators are equipped with fast on-chip memory, which is exclusively used for computations, while the main memory only stores inputs and outputs of neural network layers.Unfortunately, in many cases, on-chip memory is not large enough to fit an entire feature map, meaning that only a part of the feature map can be loaded and processed at a time, while the bulk of the data remains in the external or off-chip memory.
To achieve this, feature maps must be divided into smaller pieces, often referred to as slices, views, or tiles.Tiles are loaded from external into the on-chip memory, processed by the NPU, and the result is stored back into the external memory to free the on-chip memory for the next tile.To avoid overwriting the input feature map data in external memory, the existing DL compilers allocate a separate external memory region for the output feature map and retain the entire input in memory until it is fully processed by the NPU.This design decision is made in part because the tile of a tensor, in general, can not be presumed to occupy a contiguous memory region because it is a slice of a multidimensional tensor stored in a linear memory space.Consider a tensor with 128×128 elements allocated in the offchip memory in row-major order.Fig. 1 shows the memory regions occupied by tiles of sizes 128 × 32, 128 × 64 and 64 × 128.Because of this non-contiguous nature of tensor tiles, freeing memory occupied by a tile earlier than the entire feature map is processed would result in a highly fragmented memory space, and it is presumed that it would yield little opportunity for memory consumption optimization.
The tasks of finding the optimal tiling strategy and allocating memory for tensors are performed by Deep Learning compilers such as Glow [13], TVM [14], ngraph [15], DBLP [16] or DLVM [17].This paper proposes compiler optimization for NPUs with the main goal of reducing the required amount of external memory occupied by large intermediate feature maps.This is achieved by reusing the memory occupied by a tile of the input tensor as soon as the tile is processed, and there are no more usages of its data.In many cases, memory space is fragmented after freeing a single tile.Our method benefits from the fact that this fragmentation can demonstrate patterns in which different tiles can often occupy free non-contiguous memory regions without any overlap with the remaining tiles of the original tensor.
We propose a method of allocating tensors in external memory that considers both the difference in the lifetimes of a tensor and its tiles and the non-continuity of the memory occupied by tiles.We demonstrate that our memory allocation algorithm can be effectively applied to reduce the amount of external memory used by a neural network model.
In Section II, we describe the problem in detail and present several previously proposed solutions.We present our Tile-Aware Allocator (TAA) design in Section III, and in Section IV we show that this approach can significantly reduce peak off-chip memory usage in many popular NN models, and compare external memory demand with the Shared Objects algorithm described in [18].

II. PROBLEM STATEMENT AND RELATED WORK
Allocating tensor data is a common task for many deep neural network (DNN) compilers.There are two main approaches to memory allocation: dynamic and static approaches.Dynamic allocation involves a runtime environment that allocates memory during a program execution.In contrast, static memory allocation reserves memory regions for tensors and data structures at the compile time before the program runs.Deep Learning compilers typically implement a memory manager that conducts a memory planning pass for static allocation to a pre-allocated memory buffer holding intermediate tensors and applies optimizations.This paper discusses an algorithm for memory management in a Deep Learning compiler that facilitates static memory allocation.
The task of allocating tensor data resembles a strip-packing problem.This problem involves a memory region with a specified width and infinite height along with a set of items (tensors) characterized by their size and lifetime.In this context, the width of the memory region corresponds to time, whereas its height represents a memory offset.The lifetime of a tensor is typically determined by the interval between its first and last use in a graph walk.The problem is to find the lower-left corner of each item such that no overlap between items occurs and the height of the packing is minimal.

A. PRIOR WORK
The strip-packing problem is known to be NP-hard, therefore, researchers have attempted to find heuristics to approximate the optimal solution [19], [20], [21], [22].
In general, DNN Compilers split the strip-packing problem into memory allocators and schedulers.Memory allocators focus on finding item offsets that produce optimal packing, and keep the tensor lifetime fixed.Schedulers vary only tensor lifetime.Both can be modified to produce more optimal packing.Pisarchyk and Lee [18] proposed a memoryoptimizing allocator based on a linear scan algorithm, Barenboim et al. [23] used graph coloring, and Ahn et al. [16] implemented a memory-aware scheduler.
Other methods for memory usage reduction exist.Generalpurpose deep learning frameworks such as Caffe [24], PyTorch [25], and MXNet [26] incorporate several static memory reduction techniques.Common optimization methods, such as operation fusion, in-place operations, and memory sharing, can be adapted for deep learning compilers that target edge devices.For instance, operation fusion combines activation functions with the preceding operation 9932 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(e.g., convolution), optimizing both the performance and memory transfer between the host device and the accelerator.In-place operations store output values directly in the memory assigned to an input value.Memory sharing optimization repurposes the memory of intermediate results that are no longer required for further computation.Similar optimizations are discussed in [27], whereas [28] delves into memory optimization techniques aimed at reducing memory consumption and enhancing computational effectiveness by minimizing additional memory transformations for memorybound operations.Operations such as depthwise convolution can introduce additional performance overhead.
In [29], the authors optimized the on-chip memory using a Scheduling Method with operation fusion and memory reuse.Additionally, frameworks such as the Glow Compiler [13] implement buffer sharing optimizations that attempt to reuse an instruction input's buffer for its output.Minakova and Stefanov [30] and Artemev and Roeder [31] employed the Cyclo-Static Dataflow model and MapReduce techniques, leveraging CNN properties to process model input in parts, thus reducing the size of intermediate tensors.Generalpurpose deep learning frameworks, such as those in [25] and [32], are designed for both training and inference of neural networks.In this study, we focus on the inference-only work mode.Consequently, certain memory optimizations are inapplicable, such as in-place operations [27] which are utilized during the training mode and require data from the forward pass to be retained for the backward pass.

B. CHALLENGES
Freeing memory occupied by a tile introduces memory fragmentation because tiles occupy non-contiguous regions of memory.To place a different tile into this non-contiguous memory space, we need to be sure that the new tile does not intersect any other live memory region between the free memory fragments.
Let us consider tile t that spans s i (t) elements in i-th dimension of tensor T , starting with element o i (t), i ∈ [0, rank(T )).The memory distance between two consecutive elements of t in dimension i is denoted as stride i (t).The memory address of a tile element at index x is given by The two tiles t and t of tensors T and T overlap if they share the same memory address for some elements x and x respectively: addr(x, t, T ) = addr(x, t, T ) Alternatively, they overlap if the following equation has a solution for some x and x rank(T ) where 0 ≤ x i < s i (t) for i ∈ [0, rank(T )) and 0 ≤ xj < s j (t) for j ∈ [0, rank( T )).
This linear Diophantine equation with bounded variables can be solved in polynomial time [33], [34].In a naive approach, this equation needs to be solved n • m times to determine whether a single tile of a tensor conflicts with any other tile, where n is the number of tensors in a model, and m is the number of tiles in a tensor.Considering that there are m tiles in a tensor, to determine if a tensor can be placed at a given offset, this equation must be solved n•m 2 times.To find a suitable offset, additional S attempts are required, where S is proportional to the size of the on-chip memory in bytes.Overall, the complexity of this approach is O(S • n 2 • m 2 ) for allocating all the tensors.This naive approach is prohibitively expensive in terms of computational cost.Therefore, this study focuses on two main questions.
1) How to efficiently check if two tiles of a tensor intersect?2) How to adjust the desired memory location for the current tensor when a conflict exists?

C. DEFINITIONS
• First, a tile is defined as a fragment of a tensor.Each tensor is associated with a set of tiles.
• We define tile lifetime as the time interval between the first and last use of the tile.
• The tensor lifetime is the interval between the first and last uses of a tensor as a whole.For large tensors that cannot fit entirely into the limited on-chip memory, the tensor lifetime spans the time from when it is constructed from the output tiles of the previous layer to when it is first divided into input tiles for the next layer.Often, this means that the tensor lifetime is effectively reduced to zero: no operation can use the entire tensor without dividing it into smaller parts.
• We define a ''peer'' as an item (a tile or tensor) whose lifetime intersects with the lifetime of a given item.
• Next, we place constraints on the memory addresses where a tensor can be allocated.The tiles offsets relative to the start of the tensor are fixed.The memory address of the tensor is considered invalid if any of the tiles overlap with another allocated item or its tile.

III. MEMORY ALLOCATION ALGORITHM
In this study, we propose a tile-aware memory allocator (TAA) that utilizes the difference between the lifetimes of the entire tensor and its individual tiles to reduce the off-chip memory footprint.Our approach efficiently handles a large number of tiles resulting from processing tensors on a chip with limited on-board memory.The main idea is to reuse the memory occupied by a tile immediately after its last use, but before the last use of the entire tensor.
The main contributions of this paper are as follows.
• We define the conditions under which two tensors with intersecting lifetimes can share memory and reduce the off-chip memory consumption.
• These conditions are used in the proposed memory allocation algorithm, which solves a modified version of the strip-packing problem.The algorithm considers both the tensor and tile lifetimes.
• Testing the implemented algorithm with widely-used neural network models shows that our memory allocation approach reduces external memory usage by up to 33%.The results are presented in IV.An example of tensor allocation, in terms of time and memory, is shown in Fig. 2. Let us assume that we have an intermediate tensor, as shown in Fig. (2a).The tensor is constructed from the six output tiles of the previous neural network layer (indicated by the pink-colored zones) and is divided into six tiles for the next layer (shown in light blue).Each tile consists of four chunks.The peak memory consumption for the tensor is marked by a dark blue zone and has a shorter lifetime than its individual tiles.
Owing the variations in tile lifetimes and their processing order, we can attempt to fit the output of an arbitrary neural network layer into the unoccupied space within the same time interval.An example of such a space utilization by an arbitrary tensor is shown in Fig. 2b.The initial tensor is indicated by a dark blue color, whereas the output tensor is marked in dark green.
The main task of the memory allocator is to find a suitable location in the memory for the next tensor, given the currently allocated items, such that none of its tiles intersect in both time and memory with another item.By iteratively applying this process to each tensor, a memory allocation scheme for the entire model is found.While the starting memory address for a tensor can be selected based on these constraints alone, the starting address for a tile is fixed at a specific offset relative to the beginning of the tensor.
Our strategy is to first find a suitable place for tensor T itself and then check that neither of its tiles conflicts with the already allocated items.If they do, we adjust the starting address of T according to the size of the overlapping region of the two intersecting chunks.
In case of a conflict, such as when the tensor being allocated intersects in time and space with another tensor (e.g., indicated by the red zone in Fig. 2c), which has been allocated in the available memory space between the live tiles, we shift the tensor address by the size of the overlap region.This process will eventually yield the configuration presented in Fig. 2c, where the green tensor is shifted relative to the zero-memory offset.
The conditions for reusing occupied tile memory in the allocation algorithm are as follows.
• The tensors are divided into tiles and processed in parts.
The order of tile processing is fixed.
• The lifetimes of tensors can intersect.
• The lifetimes of individual tiles should not intersect.

A. TENSORS ALLOCATION
First part of allocation alorithm starts with tensor sorting.All tensors are sorted according to one of the three heuristics.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• First allocate items that take up the most amount of memory.
• Allocate items with longest lifetime first.
• Sort items by decreasing number of peers.
For each tensor T in the sorted list, we use a linear scan among the allocated tensor peers to find the first unoccupied region of memory of sufficient size to fit T .
Tensor peers are identified as tiles of other allocated tensors whose lifetimes overlap with those of any tile of the given tensor.
The task of finding peers is essentially the task of, given an interval I , finding intervals overlapping with I .The solution to this problem is trivial using an augmented Binary Search Tree, with minimal start point and maximal end point of a subtree recorded at every tree node.The insertion and query complexity of this tree is O(log(n)), and the space complexity is O(n).
Algorithm 1 solves the problem of determining a suitable starting address for a tensor.We employ a linear scan starting at a given memory offset to find a candidate location for a tensor and then check if any of its tiles intersect with any allocated peer items using Algorithm 2.

B. FINDING TILE INTERSECTIONS
Once a suitable memory region has been found, we check whether no tiles of T intersect with any other allocated peer tile.We note that while existing SMT solvers, such as Z3 [35], can solve problems such as Equation 2, our task is less general than the problems these solvers were designed for.
• The first simplification is that by designing, the number of contiguous regions in a tile is small.Selecting a tiling strategy that minimizes tile fragmentation is not only beneficial for the TAA algorithm but also reduces memory transfer delays owing to fewer transfer requests, as shown by Sousa et al. [36].
• The second simplification is that the boundary conditions on the variables in Equation 2 are determined by tensor sizes, which always have a lower limit of zero and a relatively small upper limit determined by the size of a model that can fit into the device memory.Given these limitations, using a general-purpose SMT solver is not the most efficient solution because of the large number of tiles that can occur in a neural network and the prohibitively large number of equations to solve for a full allocation pass.Therefore, we employed a different approach.
In the preparation step, for each tile, we find all contiguous chunks of memory using Algorithm 3.This algorithm returns the offsets and sizes of each contiguous memory chunk of a tile.The complexity of this algorithm is O(n), where n denotes the number of chunks.
Given the offsets and sizes of each chunk of every tile, Algorithm 2 checks whether two tiles of a tensor intersect and returns the size of the first overlapping region found.It relies on the fact that Algorithm 3 returns the chunks in the order of increasing offsets and uses a linear scan across the chunks of the two tiles to find the first intersection.Note that calls to

C. STEP-BY-STEP EXAMPLES 1) ALGORITHM
In this section, we provide a step-by-step example of the RenderChunks algorithm for rendering contiguous chunks of a 4×64×128 tile of a 4×128×128 tensor in CHW format.In this example, the tile consists of four contiguous chunks of equal size with each chunk representing the top half of the tensor.
The initial offset and axis are set to 0, and the element size is 1 byte.
The check for contiguity in line 3 fails, because axis 0 is not contiguous.The criterion for contiguity can be defined as the product of all the dimension sizes starting at the current axis+1 and ending at the rank of the tensor equal to the stride of the current dimension.For axis 0, this product is equal to 64 * 128 = 8192, which is different from the stride of this dimension (16384).
Entering the loop in line 8, axis 0 is not the last axis, so we move on to line 13, where we perform a recursive call to RenderChunks with the axis increased to 1.In this case, axis 1 is contiguous (the size of the next dimension is equal to 128, as is the stride of dimension 1), so RenderChunks returns [0] for offsets and [8192] for sizes.
The following condition in line 14 checks whether the next dimension is contiguous and the returned memory chunk starts immediately after the last rendered chunk ends; if so, merges the two chunks.Otherwise, the returned chunk is appended to the result.
In line 21, we set the offset to 16384 and continue through loop three more times, ultimately returning the sizes and offsets of the contiguous memory chunks: [8192, 8192, 8192, 8192] and [0, 16384, 32768, 49152].

2) ALGORITHM COLLISIONSIZE
In this section, the execution of the CollisionSize algorithm is described.Consider a tensor of size 4×128×128 allocated at memory offset 0 and its tile t1 of size 4 × 64 × 128 at offset 0 in each dimension.Consider another tensor of the same size (4 × 128 × 128) allocated at memory offset 128, and its tile t2 of size 4 × 64 × 128 at offset 0. The CollisionSize algorithm returns the size of the overlapping memory region for the two tiles.
We start with tiles t1 and t2, and set t1base to 0 and t2base to 128.The checks in lines 1 and 2 fail because t1 lies neither completely to the left nor to the right of t2.This can be verified by introducing the notion of tile span, which is the distance between the first and last elements of the tile.The span of t1 and t2 is 57344 bytes.We can see that t2base + offset(t2) < t1base + offset(t1) + span(t1), implying that t1 does not lie completely to the left of t2, and t1base + offset(t1) < t2base + offset(t2) + span(t2), implying that t1 does not lie completely to the right of t2.
Next, we initialize the indices for iterating through chunks of t1 and t2 (t1idx and t2idx) to 0. We then use the RenderChunks function to calculate the offsets and sizes of the chunks for both tiles: For t1, RenderChunks with the shape (4, 64, 128) and strides (16384, 128, 1) results in offsets [0, 16384, 32768, 49152] and sizes [8192,8192,8192,8192] bytes for each of the four chunks.For t2, the calculation is similar.Using these offsets and sizes, we enter the main loop of the CollisionSize algorithm.The loop iterates over the chunks of t1 and t2 to check for overlap.

3) ALGORITHM TENSORALLOCATE
In this section, we provide a line-by-line execution example of the TensorAllocate algorithm.For the sake of example, let us consider a neural network consisting of one or several unary element-wise layers.The output size of such a network is always equal to its input size.Let us assume that input tensor I of size 4 × 128x128 is processed in four tiles of size 4 × 64x128, and is allocated at a memory offset of 0. The lifetime of tile i1 is [0, 1], that of tile i2 is [0, 2], i3 is [0, 3], and that of i4 is [0, 4].The lifetime of tensor I is set to [0, 0].Tile i1 is loaded onto the DLA and processed, and the result is returned to the off-chip memory.This result is the first tile of the output tensor O, o1, for which we want to find an offset and its lifetime is [2,5].We used the TensorAllocate algorithm to find the offset of tensor O.
The lifetime of tensor O is [5,5], which does not overlap with the lifetime of I.This implies that we can choose offset 0 as the first suitable address for tensor O. Therefore, we set addr to 0. The algorithm enters the while loop in line 3.After setting the overlap to zero in line 4, we enter the loop over all tiles of O, starting with o1, and the loop over all tiles of I, starting with i1 immediately after.The lifetime of o1 [2,5] does not overlap with the lifetime of i1 [0, 1], therefore we continue to i2.Tile i2 has an offset of 16384, which means that it lies completely to the right of o1, so even though the lifetimes of o1 and i2 overlap, the CollisionSize algorithm returns 0 for tile i2.The same argument applies to tiles i3 and i4.The process is repeated for all tiles of O, confirming that there is no overlap between tensors I and O, and returns an offset of 0 for tensor O.

IV. EXPERIMENTAL RESULTS
In this section, we present a comparison between the TAA approach and the algorithms described in [18].In [18], the authors used shared memory buffers and attempted to allocate sorted tensors to these shared buffers to minimize the peak of memory consumption.Three main strategies were used in this Shared Object algorithm: Greedy by Breadth, Greedy by Size, Greedy by Size Improved.The Greedy by Breadth strategy takes into account the total size of all peers for each tensor.The Greedy by Size strategy sorts tensors based on their memory size.The Greedy by Size Improved strategy sorts tensors according to their memory sizes by splitting them into levels.For each model, we used all three Shared Objects approaches from [18] (Greedy by Breadth, Greedy by Size, Greedy by Size Improved) and selected the one that produces the packing with the least peak off-chip memory.
In Table 1 we reported the peak memory usage by the intermediate tensors, as well as the total time (Table 2) taken by Algorithm 1.All models were quantized to eight bits using the open-source neural network compiler ONE [41].For model compilation, we used a single workstation with an AMD Ryzen Threadripper 2950X 16-Core processor.The results are presented in Tables 1 and 2.
In the remainder of this section, we analyzed the packing produced by TAA for some of the models, identified peak memory usage points, and showed in detail how TAA is able to reduce the peak memory usage.In Fig. 3a (left) we plotted the packing produced by the baseline Shared Objects algorithm for SqueezeNet (the peak memory is marked by the purple dotted line and zoomed in Fig. 3b).
Table 3 listed the lifetimes, shapes, and strides of the tensors.Because of the intersecting lifetimes of these tensors, the Shared Objects algorithm allocated a separate object for each of them.The TAA assigned the same offset of zero to all of these tensors.Fig. 3b (right) shows that, even though these tensors have overlapping lifetimes, decomposing these lifetimes into separate lifetimes for each tile makes it possible to allocate these tensors at the same offset with no data corruption.
Our algorithm may be ineffective for several reasons.When peak memory usage occurs owing to tensors with incompatible tile shapes, tile intersections may occur at any base tensor offset.An example of this is Inception V3, whose packing and peak memory usage, as determined by the TAA algorithm, are shown in Fig. 3c.Another reason for the ineffectiveness of TAA could be that peak memory usage occurs because of a single large feature map that dominates the other intermediate tensors.This is the case with MobileNet V2, whose input is much larger than of the other tensors in the off-chip memory, as shown in Fig. 3d.

V. CONCLUSION
In this paper, we presented a novel approach to tensor allocation that utilizes the restrictions of the Deep Learning model execution environment with a limited amount of onchip memory.We evaluate this approach on various popular model architectures and show that, by reducing the lifetime of a tensor in external memory and considering tiles as constraints on memory locations for this tensor, it is possible to significantly reduce the peak external memory usage of these models.We note that, while this approach increases the model compilation time, this increase is not significant for the models considered.

FIGURE 1 .
FIGURE 1. Memory layout of tiles of a 128 × 128 tensor in row-major format: a) tile size is 128 × 32 b) tile size is 128 × 64 c) tile size is 64 × 128.

FIGURE 2 .
FIGURE 2. Example of the tensors allocation in time and memory.a) Allocation of one tensor; b) Fitting the output of an arbitrary neural network layer into the unoccupied space; c) Resolving a collision.

Algorithm 1 19 addr
TensorAllocate 1 overlap ← -1 2 addr ← start of first memory region of sufficient size 3 while overlap ̸ = 0 do 4 ← first memory region of sufficient size, starting with addr + overlap 20 end

FIGURE 3 .
FIGURE 3. Model baseline packing and TAA peak memory graph for different RenderChunks algorithm simply retrieve a memoized value in the preparation step.The complexity of Algorithm 2 is linear in terms of the number of chunks.

TABLE 1 .
Peak external memory usage by intermediate tensors.

TABLE 2 .
Phase duration of the intermediate tensors allocation.

TABLE 3 .
Analysis of SqueezeNet peak memory usage.