An Energy-Efficient Edge Computing Paradigm for Convolution-based Image Upsampling

A novel energy-efficient edge computing paradigm is proposed for real-time deep learning-based image upsampling applications. State-of-the-art deep learning solutions for image upsampling are currently trained using either resize or sub-pixel convolution to learn kernels that generate high fidelity images with minimal artifacts. However, performing inference with these learned convolution kernels requires memory-intensive feature map transformations that dominate time and energy costs in real-time applications. To alleviate this pressure on memory bandwidth, we confine the use of resize or sub-pixel convolution to training in the cloud by transforming learned convolution kernels to deconvolution kernels before deploying them for inference as a functionally equivalent deconvolution. These kernel transformations, intended as a one-time cost when shifting from training to inference, enable a systems designer to use each algorithm in their optimal context by preserving the image fidelity learned when training in the cloud while minimizing data transfer penalties during inference at the edge. We also explore existing variants of deconvolution inference algorithms and introduce a novel variant for consideration. We analyze and compare the inference properties of convolution-based upsampling algorithms using a quantitative model of incurred time and energy costs and show that using deconvolution for inference at the edge improves both system latency and energy efficiency when compared to their sub-pixel or resize convolution counterparts.


Introduction
When building deep learning solutions for latency-sensitive image upsampling problems such as real-time super resolution, systems designers are forced to balance trade-offs between image fidelity and hardware performance. Models trained using resize convolution [20] learn to upsample images without introducing checkerboard artifacts, but rely on memory-intensive pre-processing to then inefficiently execute compute operations in a higher dimensional space where the cost is greater [11,23]. Models trained with sub-pixel convolution [23] converge faster with less test error when properly initialized [2], but require even more memory-intensive post-processing with every inference pass. Using deconvolution [31], a model can efficiently generate images without any additional data processing, but training can introduce checkerboard artifacts with gradient updates [2,20]. We propose a novel edge computing paradigm that eases the selection between these algorithms by using each in their optimal context. As depicted in Figure 1, our framework confines the use of sub-pixel or resize convolution to training in the cloud, where the cost of their memory-intensive feature map transformations is less severe. The learned convolution kernels are then transformed to deconvolution kernels, effectively reducing data transfer penalties without sacrificing image fidelity when deployed for energy-efficient inference at the edge as a functionally equivalent deconvolution.
The goal of this paper is to synthesize a collection of previous works into an edge computing design methodology to support real-time convolution-based image upsampling applications 1   our design paradigm (left), we introduce the blocks highlighted in yellow to use convolution-based image upsampling algorithms in their optimal context, effectively minimizing data transfer penalties when inferencing at the edge while preserving the image fidelity learned when training in the cloud. Standard edge computing frameworks (right) deploy pre-trained networks to inference locally on edge devices using the same high-level algorithms used in training, i.e. an identity mapping from training to inference.
applications, inference at the edge is physically separated from cloud-based training. This cloud-toedge separation of hardware is currently standard practice and reduces strain on network bandwidth, decreases system latency, and improves overall security [9]. As shown in Figure 1, standard edge computing frameworks deploy pre-trained neural networks to execute locally on edge devices and the high-level algorithms used during training are the same as those used during inference. During inference, these high-level algorithms are executed directly as latency-optimized compute kernels often selected by a runtime optimizer without explicit directive from a systems designer. We refer to these compute kernels as low-level algorithms so as to not confuse them with learned convolution weights, which are also referred to as kernels. Our proposed edge computing paradigm enables the use of low-level deconvolution algorithms as inference solutions for real-time image upsampling applications without sacrificing image fidelity. Under our paradigm, the high-level algorithms used for training are not the same as those used for inference. By doing so, we significantly reduce data transfer penalties to improve both latency and energy efficiency during inference at the edge. Below, we summarize our contributions 2 .
1. We enable the physical separation of high-level training algorithms from high-level inference algorithms by introducing a set of single-use kernel transformations that translate sub-pixel and nearest neighbor resize convolutions to functionally equivalent deconvolutions (Section 4).
2. We provide a comprehensive analysis of existing formulations of deconvolution inference solutions and explore their use as low-level algorithms at the edge (Section 2.3).
3. We introduce a novel variant of the reverse looping deconvolution algorithm [33] that exposes more opportunities for concurrent execution and improves its adaptability to the limited resource availability of edge platforms (Section 3). 4. We analyze and compare the properties of these algorithms under a quantitative model to verify our design paradigm to show that translating to deconvolution for inference at the edge from sub-pixel or resize convolution in the cloud reduces time and energy costs (Section 5).

5.
We summarize the implications of these experiments and provide recommendations for system designers to support real-time, energy-efficient convolution-based upsampling (Section 6).

Convolution-based Image Upsampling Algorithms
Many solutions to important computer vision and image processing applications require increasing the number of pixels per unit area (or resolution) by inferring values in high dimensional spaces from low dimensional representations -e.g. scene segmentation [17], pose estimation [25], image generation [13], or super resolution [23]. This process, commonly referred to as upsampling, is a one-to-many mapping to predict, generate, or recover information to increase dimensionality [23,27]. In contrast, downsampling is a many-to-one mapping used to encode features and reduce dimensionality [27]. Deep learning upsampling frameworks typically rely on convolution-based and/or interpolation-based algorithms to increase resolution: • Interpolation-based algorithms infer the value of pixels for which there are no sample points using only local information such as nearest pixel value. This class of techniques includes algorithms such as nearest neighbor and bilinear interpolation.
• Convolution-based upsampling algorithms also infer the value of pixels for which there are no sample points but predict, generate, or recover high frequency information by learning spatial correlations through training strategies [23]. This class of techniques includes algorithms such as deconvolution, sub-pixel convolution, and resize convolution, which are commonly used in end-to-end deep learning solutions [2,10,11,23,20,27].
The majority of state-of-the-art deep learning solutions for image upsampling rely on either sub-pixel or resize convolution [2,8,10,20,23,34]. However, these algorithms require memory-intensive feature map transformations at each inference pass. In Section 4, we introduce kernel transformations that exploit the functional equivalence of sub-pixel and resize convolution to deconvolution.

Sub-Pixel Convolution
The sub-pixel convolution was introduced by Shi et al. [23] to upsample images using a fully convolutional neural network and is used as the standard method for upsampling images in deep learning solutions [28,27]. As shown in Figure 3, the algorithm is executed as two serialized operations: (1) a same-padded convolution followed by (2) a pixel shuffle. However, convolution (see Algorithm 1) is inherently a downsampling operation. To upsample by a factor of r, the sub-pixel convolution first generates r 2 more output channels using a same-padded convolution to then feed the resulting output into the pixel shuffle (see Algorithm 2) to be reshaped. Aitken et al. [2] show that, when properly initialized, a network trained using the sub-pixel convolution converges faster with lower image reconstruction error than other convolution-based upsampling algorithms. However, the subpixel convolution requires pixel shuffle post-processing for every inference pass. As further discussed in Section 5, this memory-intensive feature map transformation severely limits energy efficiency.

Resize Convolution
The resize convolution was introduced by Dong et al. [10] and popularized by Odena et al. [20] to address checkerboard artifacts that can arise during training. As depicted in Figure 4, the re- Figure 3: Sub-pixel Convolution. In this example, the 2×2 input (in blue) is convolved with a 3×3 kernel (in green) using a stride S = 1 and padding P = 1 to create the 2 × 2 output (in red) before upsampling the image by a factor of 2 using the pixel shuffle (PS). The pixel shuffle is a memory-intensive post-processing operation that introduces significant overhead when used for single-batch inference (Section 5).

Algorithm 1 Standard Convolution
for i c = 0, i c ++, while i c < I C do 6: for k h = 0, k h ++, while k h < K do 7: for k w = 0, k w ++, while k w < K do 8: size convolution is executed as two serialized operations, similar to the sub-pixel convolution. To upsample by a factor of r, the resize convolution first uses (1) an interpolation-based upsampling algorithm before applying (2) a same-padded convolution in the higher resolution space [20]. Standard implementations for resize convolution rely on nearest neighbor (NN) interpolation as opposed to bilinear or bicubic [20,27]. Similar to sub-pixel convolution's pixel shuffle, the interpolationbased pre-processing is a memory-dominated algorithm required for every inference pass. Unlike the sub-pixel convolution, the same-padded convolution is executed in a higher dimensional space where the cost of time and energy is much greater [11,23]. As further discussed in Section 5, this severely limits hardware performance at inference. Figure 4: Nearest Neighbor Resize Convolution. In this example, the 2 × 2 input is first upsampled by a factor of 2 using NN interpolation, then convolved with a 3 × 3 kernel (in green) in a higher dimensional space to create the 4 × 4 output (in red). NN interpolation is a memory-intensive pre-processing operation that introduces significant overhead when used for single-batch inference (Section 5).

Deconvolution
Deconvolution (also referred to as transpose convolution) is an end-to-end learnable upsampling algorithm that inherently increases the resolution of an input image [12,31]. Deconvolution was first popularized by Zeiler et al. [31] to visualize the latent representations of convolutional neural networks and has since gained popularity in deep learning image upsampling solutions [27]. Unlike other convolution-based upsampling algorithms, deconvolution upsamples the image directly in one operation. A common misconception of the deconvolution operation is that it requires inserting zeros to perform fractionally strided convolutions. While it is possible to execute deconvolution this way, it greatly increases the input feature map size by adding redundant zero-valued operations, resulting in a much less efficient implementation [12]. We discuss this formulation further in Section 2.3.2. As shown in Figure 5, the standard deconvolution algorithm given by Algorithm 3 strides over the input space, creating overlapping sums in the output space when the stride S is smaller than kernel K [5,7,12,33]. In the context of training, these overlapping sums have been shown to introduce checkerboard artifacts as a result of gradient updates [20]. In the context of inference, accumulating over these overlapping regions requires storing partial sums when K > S and leads to communication overhead, complex dataflow, and increased resource utilization via on-chip buffering [5,7,16,33]. Algorithmic approaches to work around this overlapping sums problem are divided into three categories -reverse looping deconvolution, fractionally strided deconvolution, and transforming deconvolution to convolution. creating overlapping sums in the output space when the stride S is smaller than kernel K [5,7,12,33]. In this example, the 2 × 2 input (blue) is deconvolved with a 4 × 4 kernel (green) using a stride S = 2 and a padding P = 1 to create the 4 × 4 output (red). for i w = 0, i w ++, while i w < I W do 4: for i h = 0, i h ++, while i h < I H do 5: for i c = 0, i c ++, while i c < I C do 6: for k h = 0, k h ++, while k h < K do 7: for k w = 0, k w ++, while k w < K do 8:

Reverse Looping Deconvolution
Zhang et al. [33] was the first to propose an efficient deconvolution inference solution for the overlapping sums problem. Using reverse looping and stride-hole skipping techniques, they traverse the output space rather than the input space to expose opportunities for concurrent execution. The reverse looping deconvolution algorithm (REVD) skips over the output space in S 2 independent tiles to be computed concurrently. The resulting algorithm (see Algorithm 4) relies on expensive modulo arithmetic for address calculations. Observing that the modulo arithmetic needed to calculate the output pixel address o h is only dependent on the kernel address k h , Colbert et al. [7] minimize its impact by pre-computing and caching these offsets for each value of k h . Assuming square kernels, the process reduces the number of modulo operations to 2K, which minimizes resource utilization and on-chip memory as K tends to be small. In Section 3, we propose a variant of this algorithm without the use of stride-hole skipping to further expose opportunities for concurrent execution.

Fractionally Strided Deconvolution
The fractionally strided deconvolution (STRD) can be implemented using unmodified convolution accelerators and is most commonly used by machine learning frameworks such as PyTorch [21] and TensorFlow [1]. It avoids the overlapping sum problem by padding each input pixel by S zeros before executing a standard convolution using transposed kernels [12]. As such, it can be viewed as two serialized operations: (1) a zero-insertion feature map transformation followed by (2) a samepadded transposed convolution. As shown in Figure 6, to upsample by a factor of r, the fractionally strided convolution first inserts H − 1 rows and W − 1 columns of S − 1 zeros into the input feature map [12] before transposing the deconvolution kernels to execute a same-padded convolution. While this works around the overlapping sum problem, it introduces massive redundancies as the input feature map grows [12]. Figure 7a shows how the percentage of zero-valued input pixels increases with upsamling factor r. For the fractionally strided deconvolution to be equivalent to the example in Figure 5, S − 1 zeros are first inserted in between the input pixels. Then, the transpose of the 4 × 4 deconvolution kernels (green) are convolved over the 3 × 3 input (blue) with a stride of S = 1 and padding P = 2 to create the 4 × 4 output (red).

Transforming Deconvolution to Convolution
Chang et al. [5] avoid the overlapping sum problem by transforming deconvolution into S 2 tiled convolutions to compute each region of the output feature map independently. They refer to this formulation as transforming deconvolution to convolution (TDC). To split a deconvolution into S 2 convolutions, the algorithm first uses the transformation given by Algorithm 5 to split the deconvolution kernels into S 2 tiles of size K T × K T , where K T = K S . When K is not evenly divisible by S, this transformation requires padding each kernel tile by P K , where P K = (S×K T )−K. Figure 7b shows how the percentage of zero-valued kernels increases with upsampling factor r. Given the transformed kernels, the output pixels of each of the S 2 tiles can then be computed concurrently using a same-padded convolution. However, similar to the sub-pixel convolution, the transformation process requires expensive post-processing to stitch the resulting tiles back together [4,5,30]. Xu et al. [30] propose a variant that stitches the output tiles during computation, which reduces the cost of data transfers by integrating post-processing arithmetic into the logic of the base algorithm. In our algorithm comparisons, we focus on this variant of TDC.  implementing deconvolution as a fractionally strided deconvolution (STRD), as is common practice, greatly increases the input feature map size by adding redundant zero-valued operations [12]. Assuming a square input feature map, even upsampling by a factor of 2 requires 75% of input pixels to be zero. As shown in blue, transforming deconvolution to a convolution (TDC) requires padding the transformed kernel tiles by PK when the kernel size K is not evenly divisible by the stride S. Assuming a square input feature map and PK = 2, the percentage of zeros increases with upsampling factor r.

Improving Reverse Looping Deconvolution
When the memory requirements of a deep learning model exceed the resources available on an edge device, the inference pass is typically divided into smaller workloads using tiling [18,32]. These tiled workloads are data-independent if they each write to distinct memory locations without overlap. Algorithms that can be divided into smaller tiles have higher degrees of parallelism as each data-independent workloads can be executed concurrently across SIMD lanes on multi-threaded hardware. When data-independent workloads are evenly balanced, tiling optimizations can increase hardware utilization and lead to improved energy efficiency. On the other hand, the presence of imbalanced workloads can force all live processes to wait for an overloaded lane to finish before synchronization [22]. Algorithms with higher degrees of parallelism expose more opportunities for efficient concurrent execution as workloads are more easily balanced across SIMD lanes.
To understand the impact of data-independence and load balance on inference acceleration, consider a processor with 16 SIMD lanes designed to accelerate a deconvolution workload. When used to upsample a 14 × 14 image by a factor of 2 using a stride of 2 such that r = 2, S = 2, and O H × O W = 28 × 28, a designer could use a tile size of 7 to divide the total work into 16 dataindependent workloads to be dispatched across each lane and achieve 100% load balance. However, both the reverse looping deconvolution (REVD) and transforming deconvolution to convolution (TDC) algorithms traverse the output space in S 2 tiles [5,33]. As a consequence, functional correctness breaks down when the tiling along the output space is not perfectly divisible by the stride S. A tiling factor of 7 is not perfectly divisible by the stride of 2 so a designer is left with options 6 and 8. Using a tiling factor of 8 requires zero-padding each workload, effectively increasing the data movement by 30% 3 . This 1.3x increase is detrimental to the system's energy efficiency as energy consumption is dominated by data movement [14]. Using a tiling factor of 6 requires multiplexing through the 16 SIMD lanes twice, reducing hardware utilization to 78%, increasing system latency, and introducing imbalanced workloads that would need to wait for each lane to finish 4 . To both circumvent the overlapping sums problem and fully exploit data-independent concurrent execution, a deconvolution algorithm needs to traverse the output space without the use of stride-hole skipping.
We propose an improved reverse looping deconvolution algorithm, which we refer to as REVD2. As shown in Algorithm 6, REVD2 uses stride-hole skipping along the weight space rather than the output space to both avoid the overlapping sums problem and fully expose the data-independence of output pixels for more effective load balancing. Unlike TDC or REVD, REVD2 supports a tile size of 7 in the example described above. It also reduces the cost of modulo arithmetic. We can fully remove any dependence of REVD2 on modulo arithmetic by leveraging the data-independence of each output pixel. When dispatching each tiled workload for concurrent execution, mod(o h + P, S) can be replaced by a simple counter j initialized to P . When j ≥ S, it can be reset to zero 5 .

Kernel Transformations for Deconvolution Inference
State-of-the-art deep learning solutions for image upsampling are currently trained using either sub-pixel or resize convolution to learn kernels that generate high fidelity images with minimal artifacts [2,8,10,20,23,34]. However, convolution is inherently a downsampling algorithm. As discussed in Section 2, inferencing with these learned convolution kernels requires memory-intensive feature map transformations to upsample an image. Alternatively, deconvolution is inherently an upsampling algorithm. As discussed in Section 2.3, the standard deconvolution increases the resolution of an image without reliance on extraneous feature map transformations. To preserve the image fidelity learned through training while avoiding the data transfer penalties during inference, we introduce two novel kernel transformation algorithms that exploit the functional equivalence of deconvolution to these two state-of-the-art convolution-based upsampling algorithms. As opposed to the feature map transformations required for each sub-pixel or resize convolution inference pass, these kernel transformations are intended as a one-time sunk cost in software before deploying the trained model for inference and, once deployed, can be executed using any of the deconvolution formulations described in Section 2.3 6 .
Given that functions f and g are respectively parameterized by θ and β, then g β is functionally equivalent to f θ if f θ (x) = g β (x) for all valid x. As described in Section 2, both the sub-pixel and resize convolution use same-padded convolutions, which use a stride of 1. As shown below, this restricts the valid convolution kernels sizes to be odd as K = 2P + 1. When transforming learned convolution kernels to deconvolution kernels for inference at the edge, the functional equivalence holds for these valid kernels sizes, e.g. 3, 5, 7, 9.

Sub-Pixel Convolution to Deconvolution
To upsample by a factor of r, the sub-pixel convolution generates r 2 more output channels before applying the pixel shuffle algorithm over the output space. As shown in Figure 3, this results in a unique pattern of r 2 output pixels each originating from independent channels. To replicate this pattern, a functionally equivalent deconvolution uses a stride S = r with a kernel size K D = rK. The deconvolution padding P D is calculated below where K and P are the convolution kernel size and padding given in Eq. 1, I H is the input height, K D is the deconvolution kernel size, S is the stride, and O H is the output height [12].
Building from the qualitative analysis of Shi et al. [24], we introduce the weight shuffle algorithm, given by Algorithm 7. Given a sub-pixel convolution with a valid kernel size, this algorithm transforms the K × K learned convolution kernels into rK × rK deconvolution counterparts to be executed as a functionally equivalent deconvolution. As shown in Figure 8, the weight shuffle algorithm moves the learned parameters of the convolution kernels in a similar way to the pixel shuffle, but also reverses element indices as a 2D matrix transpose. Unlike the pixel shuffle algorithm, which requires a memory-intensive feature map transformation at each inference pass, the weight shuffle algorithm transforms the kernel space of a pre-trained network. It is a one-time sunk cost that can be done in software before deploying a trained model for inference.  Figure 8: Weight Shuffle. The weight shuffle algorithm moves the learned parameters of the sub-pixel convolution kernels in a similar way to the pixel shuffle algorithm, but also reverses the element indices. Unlike the pixel shuffle, this is a one-time cost in software before deploying a trained model for inference.

Resize Convolution to Deconvolution
To upsample by a factor of r, the nearest neighbor (NN) resize convolution first uses NN interpolation to increase the resolution of the image before applying a same-padded convolution over the higher dimensional output space. As shown below to reflect Figure 4, the replication of each input pixel results in a unique pattern of r 2 identical output pixels. Formulating the ensuing same-padded convolution as a matrix multiplication enables a significant reduction in operations.
Solving for both sets of equations, we find that the linear combination follows a locally connected pattern similar to a convolution with transposed kernels, as shown below.
w D 1,2 = w 1,0 + w 1,1 + w 2,0 + w 2,1 To account for the redundant replication of input pixels, a functionally equivalent deconvolution uses a stride S = r and maintains padding such that P D = P = K−1 2 . The resulting kernel size K D is calculated below where K is the convolution kernel size given by Eq. 1, I H is the input height, S is the stride, and O H is the output height [12].
Building from this algebraic reduction, we introduce the weight convolution, given by Algorithm 8. Given a NN resize convolutio with a valid kernel size, this algorithm transforms K × K learned convolution kernels into (r + K − 1) × (r + K − 1) deconvolution kernels to be executed as a functionally equivalent deconvolution. To stride over the kernel space, the inequality i+K −1 < K D must hold such that i < r. As shown in Figure 9, the weight convolution transposes the learned kernels before convolving over the weight space with a stride of 1. Similar to the weight shuffle, this algorithm is a one-time sunk cost that is done before deploying a trained model for inference. However, unlike the weight shuffle, the weight convolution drastically lowers the total compute work as a consequence of the algebraic reductions. The resulting deconvolution requires only H × W × C 2 × (r + K − 1) 2 multiply-accumulate (MAC) operations. When compared to the resize convolution, which requires H × W × C 2 × K 2 × r 2 , this is a significant reduction that scales with upsampling factor r. Using the standard kernel size of 3, the resulting ratio of deconvolution MACs to NN resize convolution MACs becomes (r+2) 2 (9r 2 ) . When upsampling by a factor of 2, the resulting deconvolution only requires 44% of the MAC operations used by its NN resize convolution counterpart to generate the same image. When upsampling by a factor of 3, the resulting deconvolution only requires 30% of the MACs to generate the same image.

Time and Energy Analysis for Local Inference
In our proposed edge computing paradigm depicted in Figure 1, a deep learning model is first trained in the cloud using either sub-pixel convolution (C-SP) or nearest neighbor resize convolution (C-NN). The learned weights are then recast by kernel transformations (Section 4) to be deployed for inference as a functionally equivalent deconvolution -D-SP or D-NN, respectively. In Section 2.3, we discuss the various formulations of deconvolution that avoid the overlapping summation problem (a) 2D Transpose Figure 9: Weight convolution. In this example, we can equate the operations depicted in Figures 4 and 5 by first rotating the kernels to reverse the indices, the convolving the reversed 3 × 3 convolution kernel (in green) over the 4 × 4 deconvolution kernel (in yellow) with a stride of 1. for i = 0, i++, while i < r do 5: for j = 0, j++, while j < r do observed when the kernel size K is greater than the stride S. We consider the following deconvolution variants as low-level inference algorithms -improved reverse looping deconvolution (REVD2), transforming deconvolution to convolution (TDC), and fractionally strided deconvolution (STRD).

Algorithm 8 Weight Convolution
Here, we present a framework for quantitative analyses to validate our proposed paradigm based on the properties of each algorithm. Results and conclusions are discussed in Section 6.

Quantitative Models for Time and Energy Costs
We use the simplified optimistic quantitative model detailed by [6] to compare the time and energy costs of each algorithm characterized by their compute and memory requirements 7 . Given that T comp and T mem are the total time to execute all compute and memory operations, respectively, the time cost model given by Equation 14 assumes an idealized hardware design that perfectly masks the communication overhead with computation work [6]. Here, a higher computation-to-communication ratio would better hide memory bandwidth bottlenecks.
Similarly, let E comp and E mem be the total energy to execute all compute and memory operations, respectively, and let E 0 (T ) be the cost of constant energy expended while executing the algorithm. Unlike time cost, the energy cost model given by Equation 15 does not overlap computation and communication costs and has an additional penalty for increased latency [6].
For a fixed hardware architecture, let τ comp and τ mem be the time cost per compute and memory operation, respectively. For a given algorithm, let C be the total number of computation operations and M the total number of memory operations required. Under the optimistic assumption of hiding memory latency with perfect overlap, the total time cost of an algorithm becomes Equation 16, where T comp ≡ Cτ comp and T mem ≡ M τ mem [6].
Similarly, let comp and mem be the energy cost per compute and memory operation, respectively. The total energy cost of an algorithm then becomes Equation 17, where E comp ≡ C comp , E mem ≡ M mem , and the constant energy cost is assumed to be linear in time with a fixed constant power defined by π 0 such that E 0 (T ) ≡ π 0 T [6].
Given the compute and memory requirements of an algorithm, we use Equations 16 and 17 to estimate time and energy costs using this idealized abstraction of hardware performance.

Deconvolution Time and Energy Costs
Unlike REVD2, both STRD and TDC require the insertion of zeros to upsample an image by a factor of r. As shown in Figure 7, the presence of these redundant zero-valued computations increases with the upsampling factor and cannot be ignored when analyzing data movement patterns 8 . Table 1 summarizes the compute and memory requirements for each of the low-level deconvolution algorithms. Compute requirements (C) are measured by the number of multiply-accumulates (MACs). Memory requirements (M ) are measured by the number of weights (W ) and activations (A), i.e. the sum of the input and output feature maps such that M ≡ W + A. We separately consider deconvolution as translated from sub-pixel convolution (D-SP) and deconvolution as translated from nearest neighbor resize convolution (D-NN). To estimate time and energy costs, we translate these compute and memory requirements using the idealized abstraction of hardware performance given by Equations 16 and 17. In each experiment, we consider the case of upsampling a square 1K RGB image by a factor of r using 3 × 3 kernels 9 . We assume all pixel values and network parameters are executed and stored at 32-bit precision, i.e. 4 bytes. Figure 10 shows the relative increase in time and energy costs as a function of upsampling factor r for each low-level deconvolution algorithm using either D-SP or D-NN formulations when executing on the NVIDIA GeForce GTX 680 10 . Because memory requirements are dominated by activations rather than weights, as shown in Table 1, the impact of the zero-insertion requirements for STRD are massive while those for TDC are minimal. As discussed in Section 3, there are no zero-insertion requirements for REVD2. Table 1: Deconvolution Compute and Memory Requirements. For simplicity, we assume square kernels K, square inputs H, and equal input/output channels C. We define PH = (H − 1)(r − 1) as the zeros inserted to pad each input pixel for the fractionally strided deconvolution (STRD).

Convolution-based Upsampling Algorithm Time and Energy Costs
Whereas deconvolution directly upsamples an image in one operation, both the sub-pixel convolution (C-SP) and nearest neighbor resize convolution (C-NN) rely on memory-dominated operations to move data for post-and pre-processing, respectively. These operations are required for every inference pass and cannot be ignored when analyzing data movement patterns. Table 2 summarizes the compute and memory requirements for each of the high-level convolution-based image upsampling algorithms. Note that the compute (C) and activation (A) requirements of C-SP and C-NN are equal. The sub-pixel convolution performs computations in low resolution (LR) space but generates r 2 more output channels and requires expensive post-processing in high resolution (HR) space. The resize convolution requires less-expensive pre-processing in LR space, but performs its computations in HR space. Using kernel transformations to enable deconvolution for inference at the edge avoids any penalties for expensive data pre-or post-processing while maintaining the image fidelity learned through training in the cloud. We separately consider deconvolution as translated from sub-pixel convolution (D-SP) and deconvolution as translated from nearest neighbor resize convolution (D-NN). For each experiment, we assume REVD2 as the low-level algorithm and consider the case of upsampling a square 1K RGB image by a factor of r using standard kernels of size 3 × 3. We assume all pixel values and network parameters are executed and stored at 32-bit precision. Figure 11 shows Using the optimistic quantitative model proposed in [6], we analyze the relative increase in time and energy costs increase as a function of upscaling factor r. Each experiment assumes a square 1K RGB input image upsampled using a standard 3 × 3 kernel on NVIDIA's GeForce GTX 680. We normalize all values by REVD2 time and energy costs without upsampling (r = 1). As shown in Table 1, the compute (C) and activation (A) requirements of TDC and REVD2 are equal. With data movement dominated by activations (A), any variance in weights (W ) is minimally impactful.
the relative increase in time and energy costs as a function of upscaling factor r when executing each convolution-based upsampling algorithm on the NVIDIA GeForce GTX 680. With activations dominating memory requirements, any deviation in weight requirements (W ) has minimal impact on time and energy costs.

Estimating Energy Efficiency using Data Reuse Patterns
The efficiency of an algorithm is typically described by data reuse and measured using arithmetic intensity -the number of useful compute operations for every byte of data accessed. Algorithms with high data reuse are more likely to yield performance improvements with an increase in compute resources because computations dominate the communication overhead. Algorithms with low data reuse put more strain on a system's memory bandwidth as each compute operation requires more off-chip memory accesses. While the value of this ratio implies the scalability and locality of an algorithm, it fails to properly estimate the energy efficiency of convolution-based deep learning algorithms when the ratio of activations (A) to weights (W ) drastically deviates from 1 [15]. By separately considering weight and activation reuse, Jha et al. [15] show that, for convolution-based deep learning algorithms, the variation in arithmetic intensity is attributed to the variation in activation reuse and is highly correlated to variations in energy efficiency 11 . Following this work, we use the compute and memory requirements discussed in Section 5.1 to estimate the energy efficiencies of convolution-based upsampling algorithms using activation reuse. When estimating the energy efficiency using activation reuse, we define useful compute operations as those contributing to output pixel values, i.e. non-zero-valued computations. However, energy Table 2: Convolution-based Upsampling Algorithm Compute and Memory Requirements. For simplicity, we assume square kernels K, square inputs H, and equal input/output channels C. Note that both C-SP and C-NN activations include the pixel shuffle and NN-interpolation, respectively, as they are required for every inference pass. We assume REVD2 as the low-level deconvolution algorithm. Using the optimistic quantitative model proposed in [6], we analyze the relative increase in time and energy costs as a function of upsampling factor r. Each experiment assumes a square 1K RGB input image upsampled using a standard 3 × 3 kernel on NVIDIA's GeForce GTX 680. We normalize all values by D-SP time and energy costs without upsampling (r = 1). As shown in Table 2, the compute (C) and activation (A) requirements of C-SP and C-NN are equal. With data movement dominated by activations (A), any variance in weights (W ) is minimally impactful. We assume REVD2 as the low-level deconvolution algorithm.
consumption is dominated by data movement rather than computation [14]. As such, we define activation reuse as the number of non-zero-valued computations for every byte of activation data accessed when upsampling an image by a factor of r. Figure 12 shows how activation reuse increases with upsampling factor r for each convolution-based upsampling algorithm. For D-SP and D-NN, we assume REVD2 as the low-level deconvolution algorithm. Again, we consider the case of upsampling a square 1K RGB image by a factor of r using standard kernels of size 3 × 3 where all pixel values and network parameters are executed and stored at 32-bit precision.

Roofline Models of Time and Energy
The hardware architecture analog to arithmetic intensity is time-balance [6]. For a fixed machine, its time-balance point (B τ ) is defined as the ratio of its time cost per memory operation (τ mem ) to its time cost per compute operation (τ comp ) such that B τ ≡ τ mem /τ comp [6]. Similarly, the energybalance point (B ) of a machine is defined as the ratio of its energy cost per memory operation ( mem ) to its energy cost per compute operation ( comp ) such that B ≡ mem / comp [6]. When the arithmetic intensity of an algorithm is equal to the balance of a machine such that AI = B, the cost of the algorithm's compute operations is equal to that of its memory operations. We can visualize these balance principles using roofline models of time [29] and energy [6], which provide an upper bound on the attainable performance of an algorithm for a fixed hardware as a function of that algorithm's data reuse patterns. Figure 13 shows these roofline models using the balance points provided by [6]. Here, we consider the case of upsampling a square 1K RGB image by a factor of 2 using 3 × 3 kernels at 32-bit precision. The roofs of each model are normalized to peak performance and data reuse is measured by arithmetic intensity for time and activation reuse for energy.  Tables 1 and 2 to calculate the activation reuse for each convolution-based image upsampling algorithm. Following the work of Jha et al. [15], we use this metric to estimate the energy efficiency of each algorithm as a function of upsampling factor r. Each experiment assumes a square 1K RGB input image upsampled using a standard 3 × 3 kernel. Figure 13: Normalized Roofline Models of Time and Energy. Here, we visualize roofline models of time (top row) and energy (bottom row) for NVIDIA's GeForce GTX 680. For each experiment, the black curve is the roofline model and the vertical black line is its respective balance point using values provided by [6]. The roofs of each model are normalized to peak performance. As discussed in Section 5.2, we use arithmetic intensity for the time roofline model [29] and activation reuse for the energy roofline model [6,15].

Results of Quantitative Analysis
Local edge devices are often limited by battery life and hardware area which constrains the power budget and on-chip resources available for inference [9,26]. To support real-time deep learning applications, low latency and high energy efficiency become critical. Using the quantitative models detailed in Section 5, we analyze and compare the properties of convolution-based upsampling algorithms using metrics of time and energy. Real-time image upsampling applications, such as single-image super resolution, typically process images sequentially in batch sizes of 1. When considering such applications, we define latency as the time cost incurred by upsampling a single image. We evaluate energy efficiency under two metrics. First, we use energy per pixel to evaluate the energy cost incurred by upsampling a single image [15]. This is defined as the total energy cost divided by the total output pixels generated and is measured in units of Joules/pixel [15]. Second, we use performance per energy to evaluate the rate of computation for every unit of energy consumed. This is defined as the total useful computations divided by the total energy cost and, as the energy analog of throughput, is measured in units of MACs/Joule [6]. Using these metrics, we validate the use of kernel transformations in our proposed edge computing paradigm assuming REVD2 as the low-level deconvolution algorithm.
As shown in Table 2, the compute (C) and activation requirements (A) of the sub-pixel convolution (C-SP) and nearest neighbor resize convolution (C-NN) are equal 12 . With activations dominating memory requirements (M ) and, therefore, data transfer penalties, the variations in weight requirements (W ) have negligible impact on time and energy costs. By translating their learned kernels for deconvolution inference, the kernel transformations introduced in Section 4 remove the reliance of C-SP and C-NN on the memory-intensive feature map transformations that increase activate requirements. We highlight the following implications of this reduction in memory accesses, assuming REVD2 as the low-level deconvolution inference algorithm.
1. Alleviating pressure on memory bandwidth significantly reduces system latency. The detrimental impact of memory-intensive feature map transformations on C-SP and C-NN increases with upsampling factor. Translating C-SP to D-SP removes its reliance on the pixel shuffle post-processing and, as discussed in Section 4, translating C-NN to D-NN significantly reduces the total MACs required when removing its reliance on interpolation pre-processing. Figure 11a shows that the time cost of each algorithm increases with upsampling factor r using the optimistic computational model detailed in Section 5.1. In our experiments, upsampling an image by a factor of 2 using D-SP decreases system latency by 2.2x when compared to C-SP and using D-NN decreases system latency by 2.6x when compared to C-NN.

2.
Reducing activation requirements and, therefore, data transfers significantly reduces the energy cost of upsampling an image. Energy consumption is dominated by data movement [14].
With activation requirements dominating memory accesses, removing the memory-intensive feature map transformations of C-SP and C-NN reduces the energy consumed when generating an output image. Figure 11b shows how the energy cost of each algorithm increases with upsampling factor r using the optimistic computational model detailed in Section 5.1. As each algorithm is generating an rH × rH × C output image, we interpret this as the relative increase in energy per pixel. In our experiments, upsampling an image by a factor of 2 using D-SP decreases the energy per pixel by 2.1x when compared to C-SP and using D-NN decreases the energy per pixel by 2.5x when compared to C-NN.
3. Reducing data transfers significantly improves the rate of computation for every unit of energy consumed. As discussed in Section 5.2, activation reuse is defined as the ratio of useful compute work to activation requirements. The activation reuse of convolution-based upsampling algorithms is tightly correlated with performance per energy (PPE) [15]. Figure 12 shows how the activation reuse of each convolution-based upsampling algorithm increases with upsampling factor r. While removing reliance on memory-intensive feature map transformations improves the PPE of D-SP, it reduces the PPE of D-NN as the reduction in MACs significantly outweighs the reduction in memory accesses. This imbalanced reduction ultimately renders D-NN memory-bound as the amount of compute work grows slower than the amount of memory accessed.

4.
Reducing memory accesses improves algorithm scalability. The ratio of useful computations to memory accesses, i.e. arithmetic intensity, implies the scalability of an algorithm. As described in Section 5.3, algorithms with an arithmetic intensity lower than the machine balance point are ultimately bound by memory bandwidth [29]. Algorithms with an arithmetic intensity higher than the machine balance point are ultimately bound by compute resources and are more likely to see performance gains as resources scale [6]. For each convolution-based upsampling algorithm, we use the roofline models depicted in Figure 13 to visualize the relationships of their arithmetic intensities to the time and energy machine balance points of NVIDIA's GeForce GTX 680. We further show how these relationships change with upsampling factor in Figure 14. Unlike C-SP, C-NN, and even D-NN, D-SP remains compute-bound in both time and energy as upsampling factor increases. With compute work dominating memory accesses, the increased arithmetic intensity of D-SP implies increased time and energy efficiency as compute resources scale [6].
These trends do not hold for all selections of deconvolution formulations. Deep learning frameworks commonly use the fractionally strided deconvolutionn (STRD) formulation to leverage unmodified convolution accelerators [21,1]. However, the zero-insertion requirements on the input 12 For simplicity of analysis, we ignore the impact of address calculations and modulo arithmetic. feature maps exponentially increase the data transfer penalties. Figure 10 shows how time and energy costs increase with upsampling factor. When upsampling by a factor of 2, using REVD2 reduces latency by 1.6x and reduces energy per pixel by 1.9x in our experiments. As shown in Table 1, the compute and activation requirements of transforming deconvolution to convolution (TDC) are the same as REVD2. As such, the TDC zero-insertion requirements on the learned kernels have negligible impact on high resolution images. However, as discussed in Section 3, the functional correctness breaks down when attempting to tile the workloads in sizes not evenly divisible by the stride S. While this penalty does not show in our simplified quantitative model of hardware performance, we aim to quantify this impact in future work.
(a) Figure 14: The Scalability of Convolution-based Upsampling Algorithms. The memory-intensive feature map transformations of C-SP and C-NN render each algorithm memory-bound in both time and energy on NVIDIA's GeForce GTX 680. Avoiding the pixel shuffle post-processing of C-SP drastically increases the efficiency of D-SP for inference. The increased activation reuse implies increased time and energy efficiency as compute resources scale because the overwhelming majority of work is dedicated to computations rather than memory accesses [6]. For D-NN, the significant reduction in MACs outweighs the reduced memory accesses, ultimately rendering it memory-bound in both time and energy as the memory pressure increases faster than the compute workload as upsampling factor increases. Unlike C-SP, C-NN, and even D-NN, D-SP remains compute-bound in both time and energy as upsampling factor increases.

Conclusions and Future Work
Cloud computing systems can have nearly limitless resources, making them ideal for resourceintensive tasks such as data storage, data processing, and model training. However, real-time deep learning applications often require edge computing frameworks to improve system latency by executing inference locally on edge devices without reliance on a stable internet connection [9,26]. We propose a novel edge computing paradigm for real-time convolution-based image upsampling applications that separately considers algorithms for training in the cloud and inference at the edge. The use of sub-pixel or resize convolution is confined to training in the cloud to minimize the data transfer penalties incurred by the memory-intensive feature map transformations they require for inference. The learned convolution kernels are then transformed to deconvolution kernels without sacrificing the image fidelity learned in training. These deconvolution kernels are then deployed for inference at the edge using our improved reverse looping deconvolution algorithm (REVD2). We compare REVD2 against pre-existing deconvolution variants and show it is more efficient and parallelizable. Using quantitative models of time and energy, we show that executing deconvolution inference at the edge with REVD2 improves both system latency and energy efficiency when compared to sub-pixel or resize convolution counterparts. When optimizing for energy efficiency and scalability, we show that training with sub-pixel convolution in the cloud and then transforming the learned kernels using the weight shuffle for deconvolution inference at the edge minimizes the pressure on memory bandwidth and maximizes energy efficiency. When optimizing for latency and energy consumption, we show that training with nearest neighbor resize convolution in the cloud and then transforming the learned kernels using the weight convolution for deconvolution inference at the edge minimizes the time and energy costs incurred by upsampling an image. In future work, we aim to extend our analyses to quantify adaptability to available hardware resources and build hardware designed to exploit the parallelism that is exposed from REVD2. Code for each algorithm discussed in this paper can be found at https://github.com/icolbert/upsampling.