Design of a Low-Power Super-Resolution Architecture for Virtual Reality Wearable Devices

Head-mounted displays (HMDs) have made a virtual reality (VR) accessible to a widespread consumer market, introducing a revolution in many applications. Among the limitations of current HMD technology, the need for generating high-resolution images and streaming them at adequate frame rates is one of the most critical. Super-resolution (SR) convolutional neural networks (CNNs) can be exploited to alleviate timing and bandwidth bottlenecks of video streaming by reconstructing high-resolution images locally (i.e., near the display). However, such techniques involve a significant amount of computations that makes their deployment within area-/power-constrained wearable devices often unfeasible. This research work originated from the consideration that the human eye can capture details with high acuity only within a certain region, called the fovea. Therefore, we designed a custom hardware architecture able to reconstruct high-resolution images by treating foveal region (FR) and peripheral region (PR) through accurate and inaccurate operations, respectively. Hardware experiments demonstrate the effectiveness of our proposal: a customized fast SR CNN (FSRCNN) accelerator realized as described here and implemented on a 28-nm process technology is able to process up to 214 ultrahigh definition frames/s, while consuming just 0.51 pJ/pixel without compromising the perceptual visual quality, thus achieving a 55% energy reduction and a $\times 14$ times higher throughput rate, with respect to state-of-the-art competitors.

than that supported by the display is actually transferred to the destination [6]. Then, an appropriate upscaling step is performed at the HMD side, by specialized hardware modules. Second, the SRCNN approach supports the subsequent detection/classification steps by enhancing the quality and the number of detected features [7]. For these reasons, SRCNN models were successfully exploited in the context of HMD devices [8], [9], [10]. Unfortunately, a straightforward implementation of the above models moves the problem from the transmission side to the computation side, making most of them unsuitable for energy-constrained applications. Commonly adopted approaches, such as quantization and pruning, may not be efficient enough to cope with abovementioned constraints [11]. Therefore, the realization of lowpower upscaling hardware modules for HMDs is still a challenge.
With the aim of efficiently exploiting the SR technique in HMD devices, we propose a new processing strategy that makes use of the concepts of foveated rendering and approximate computing, thus allowing the computational complexity of the SR CNN elaboration to be significantly reduced. The new approach exploits the fact that the human visual system (HVS) is characterized by high visual acuity only at the central 5.2 • region of the retina, named the fovea [12]. Outside this area (i.e., in the periphery), the distribution of retinal components changes rapidly, resulting in relatively lower visual acuity, color sensitivity, and stereo depth discrimination capability. The main contributions of this work are reported in the following. 1) We introduce a new computational scheme that upscales low-resolution images by processing foveal region (FR) and peripheral region (PR) at different levels of accuracy. 2) The proposed approach is applied to the fast SR CNN (FSRCNN) model [13], and a thorough accuracy evaluation on image samples from the Set5, Set14, B100, and Kodak datasets is presented. 3) To demonstrate its effectiveness, an ON-purpose designed architecture has been implemented by using both field programmable gate array (FPGA) and applicationspecific integrated circuit (ASIC) technologies. Results show significant improvements over state-of-the-art competitors in terms of both speed and energy performances, keeping comparable overall (OA) reconstruction quality.

II. BACKGROUND AND MOTIVATIONS A. Foveated Rendering Displays
Understanding the HVS and its limitations is of fundamental importance to implement efficient machine vision systems able to satisfy, at the same time, application quality and latency requirements [12]. The HVS consists of two main components: the eyes and the brain. The former acts as image sensors, acquiring information that will be transferred to the brain for next elaborations. During the sensing process, the light photons reaching the retina are converted to electrical signals by photoreceptors. However, since the latter are not uniformly distributed across the retina, a significant compression occurs at this level, making just a portion of the original information able to reach the brain. Such a spatially varying photoreceptor density, having a peak at the center of the retina, leads to a visual acuity that is maximum at the fovea and is reduced toward the periphery. Therefore, the HVS is not able to uniformly perceive the quality of each pixel of a digital image. According to the study presented in [14], in a 100 • wide field of view HMD, only 4% of the screen pixels fall in the FR, whereas the rest lie within the peripheral one. This outcome highlights that foveated rendering is well suited to be exploited in modern HMD devices in order to reduce the computational workload without sacrificing the visual perceptual quality.
As depicted in Fig. 1, the latest HMDs are equipped with numerous cameras [1]. One or more sensors are adopted for real-time gaze tracking, in order to discern where the user's eyes are looking. This feature can be used to enable dynamic foveated rendering according to gaze information. The eyetracker is responsible for detecting saccades, i.e., rapid and conjugate eye movements that voluntarily shift the eyes from one target to another, thus producing a corresponding change of the FR within the observed scene. When a saccade occurs, the eye-tracker extracts the region of interest and converts it in the form of bounding box coordinates (in the following with x R , y R , W R , and H R ). Since the time interleaving two consecutive saccades is ≈300 ms [15] and the display refresh rate of modern HMDs is 90 Hz, a new region of interest is identified about every 27 frames. Such an interval represents the available time during which an eye position change should reflect in the update of the display and corresponds to the target display latency. The latter influences the design of both the eye tracker and the foveated-based postprocessing module. Furthermore, recent studies [16] demonstrated that HMD latencies higher than 80 ms cause a significant reduction in the acceptable amount of foveation, thus making artifact defects of PRs dominant. It is thereby expected that, in the near future, significant efforts will be spent toward the hardware acceleration of foveation rendering processes involved in this kind of applications.

B. Super-Resolution
SR imaging allows resolving high-resolution images from corresponding low-resolution ones. Due to this upscaling ability, it is widely used in a lot of contexts, ranging from image quality enhancement for medical [17] or remote sensing applications [18] to the restoration of compressed images coming from web and mobile devices [19]. While earlier SR methods [2] mainly relied on pixel interpolation to reconstruct missing information, the advent of deep learning has put a substantial revolution, making high-resolution details predictable through properly learned spatial correlation Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. parameters. Such a new paradigm allows for achieving greatly improved performances compared to traditional techniques.
Recently, some SR CNN models suitable for HMDs have been demonstrated in [8], [9], and [10]. These CNNs share a common structure, which includes an initial convolutional layer followed by residual blocks capturing fine details with high-frequency and a final upscaling stage implemented through either subpixel or transposed convolutions (TCONVs). Despite the interesting reconstruction quality achieved by these methods, their computational complexity is quite high and represents a limitation for the deployment within wearable devices like HMDs. Just as an example, the CNN model presented in [10] processes only a cropping of the original low-resolution image, corresponding to the FR. Then, the output frame is reconstructed by merging images at different resolutions, leading to a reduction of the vertical resolution with respect to the target. Even adopting a foveated-based SR technique, the CNN model [10] involves more than 350 Giga multiply-and-accumulate (MAC) operations when used for a ×2 upscaling of a 1920 × 1080 image. In our view, such computational requirements could be, however, reduced by eliminating intermediate residual blocks, at the cost of somewhat quality degradation.

C. Hardware-Oriented Acceleration Methods
TCONV is one of the most common techniques used in SR CNNs to perform upscaling of low-resolution images. As in similar works [20], [21], [22], [23], [24], [25], to preserve the OA quality of the output image, the upscaling factor equal to 2 is considered. Fig. 2 illustrates the conventional TCONV process. First, the pixels of the input are rearranged within the upsampled image up so that actual pixel values are interleaved by zeros (lines 3 and 4); up is then convolved with the t × t-sized filter kernel K to generate the output image O with size 2H × 2W (lines [5][6][7][8][9][10][11][12].
TCONV layers represent the most critical component of SR CNNs. Indeed, their computational complexity is up to 6.75 times higher than traditional convolutional layers [20]. Furthermore, they introduce many unnecessary multiplications by zeros and require more complex strategies to access data memory, making video streaming unfeasible in some cases. Most prior hardware designs, targeting either FPGA [20], [23], [24] or ASIC [21], [22], [25] technologies, rely on transforming the TCONV into multiple subconvolutions. While these methods perform accurate TCONVs by rearranging either the filter kernels [20], [21], [24] or the incoming pixels [23], they are not effective in reducing the number of involved MAC operations and introduce latency overheads and/or additional ON-chip memory requirements. Differently, the ASIC design presented in [22] is based on slicing the low-resolution input image into multiple tiles that are processed by a decision network to establish the number of layers each tile has to pass through, according to its content. Even though such an approach can be useful in saving MACs, the reduction rate is not deterministic, and, in the worst case, all the tiles have to be processed by all the layers of the CNN.
In this article, we aimed at filling the gap between the HMD device requirements and current SR CNN hardware implementations by smartly combining the concepts of foveated rendering and approximate computing [26]. In order to reduce the computational complexity of TCONV layers used in SR CNNs, we propose a hybrid computational scheme (in the following named HTCONV) that performs mixed accurate and inaccurate elaborations on the pixels according to their position with respect to the foveal alignment. In contrast to [10], the new method elaborates all the pixels within the original low-resolution image, thus leading to an output frame complying with the target resolution.

III. PROPOSED HTCONV A. Computational Scheme
The computational scheme here proposed relies on skipping convolutions for a certain group of output pixels that are instead calculated by properly interpolating neighboring pixels. Fig. 3 shows the complete pseudo-code of the proposed HTCONV method. It can be seen that the output pixel O(2i, 2 j), corresponding to a nonzero value in the upsampled image up, is always computed through accurate convolution regardless of whether it belongs to the FR (line 8) or the PR (line 12). Conversely, O(2i + 1, 2 j), O(2i, 2 j + 1), and O(2i + 1, 2 j + 1) are calculated by performing accurate (line 8) or approximate operations (lines 13-15) in accordance to their position inside or outside the FR, with the approximation being the average of the pixels belonging to the 2 × 2 nearest neighborhood. This strategy allows reducing the number of multiplications and additions, respectively, by 4 × t 2 and t 2 − 1 times for each pixel position. Since more than ∼95% of the pixels lie within the low-acuity regions of the HVS [14], this approach guarantees an appreciable reduction of the OA computational load confining detrimental impacts on the reconstruction quality.
To better clarify the running of the HTCONV scheme in the PR, Fig. 4 14), and (39, 15) are reconstructed by averaging the 2 × 2 nearest neighborhood. It is worth noting that, due to the presence of zeroed row/columns within the image up, the actual number of required multiplications is t L × t L , being t L = ⌈(t + 1)/2⌉. Fig. 5 illustrates the hardware architecture designed to implement the proposed HTCONV approach. Input pixels I (i, j) and filter values K (u, v) are transferred to proper buffering structures that form the pixels and kernel windows, respectively. To prevent unnecessary registers and latency in the input pixel buffer, a direct upsampling stage is avoided. Therefore, pixels are arranged in proper t L × t L windows, and the filter values are stored within a t × t sized register array. The input SFov includes the position of the bounding box containing the current region of interest as it is extracted by the eye tracker. This information allows discerning if incoming pixels belong to either the FR or PR, making able the Control circuit to generate proper selectors and enable signals that are used to correctly manage the OA operation. In particular, when SFov flags that I (i, j) is a peripheral pixel, the SelSubK signal selects the image and kernel windows having the generic position (2i, 2 j) as an anchor point. Such extracted windows are inputted to the MAC Array module that computes the sum of t L × t L products, thus producing the generic O(2i, 2 j) result. During this operation mode, the Freeze signal configures the pixels buffer to receive a new input I (i, j) at each clock cycle, while the en signal enables the Avg2 × 2 block. The latter includes a line buffer that properly stores pixels coming from the MAC Array. At each clock cycle, the registers A-D form a new 2 × 2 neighborhood that can be processed by the subsequent adders to calculate the approximate pixels. Finally, according to the SelOut signal generated by the Control circuit, the output multiplexers select the signal X , Y , and Z generated by the Avg2 × 2 block as O(2i + 1, 2 j), O(2i, 2 j + 1), and O(2i + 1, 2 j + 1) output pixels, respectively. After the initial latency, the proposed circuit operating on PRs exhibits an output throughput of 4 pixels/clock cycle.

B. Hardware Architecture
On the contrary, when the SFov signal flags foveal input pixels, the Avg2 × 2 block is disabled. In such a case, a new input I (i, j) feeds the pixels buffer, controlled by the Freeze signal, every four clock cycles. This allows computing the output pixels through accurate convolutions performed by the MAC Array. The latter receives, at each clock cycle, a different pair of pixel/kernel windows according to the SelSubK selector, thus generating the expected four results with a throughput of 1 pixel/clock cycle.

IV. EXPERIMENTAL RESULTS
The proposed method has been validated by integrating it within a state-of-the-art SR task based on the FSRCNN model [13]. Table I summarizes the generic FSRCNN model (d, s, m) consisting of a feature extractor, several cascaded convolutional layers, and an upscaling stage realized by a transposed convolutional layer. The number of convolutional layers, as well as the size and the number of inputs (N in ) and outputs (N out ) of each layer, may change according to the adopted configuration, thus leading to different computational complexity-accuracy scenarios. The experiments here presented have been conducted on the pretrained FSRCNN(25, 5, 1) model exploiting the mean square error as loss function, quantized at 16-bit fixed point, and customized by replacing the transposed convolutional layer with the proposed HTCONV module. The graph in Fig. 6 illustrates how each of the above-mentioned approximation strategies influences the complexity (x-axis), the peak signalto-noise ratio (PSNR) (y-axis) and the memory requirement (bubble size) of the model, with respect to the baseline FSRCNN(56, 12, 4) floating-point configuration. It is worth noting that our proposal allows saving more than 80% of computations over the counterparts, with a PSNR reduction lower than 10%. As shown in the following, such a degradation has a limited visual impact since it occurs in the PRs.

A. Image Quality Evaluation
The open-source Set5, Set14, B100, and Kodak datasets [27] were used to assess the reconstruction quality achieved by the proposed modified FSRCNN. According to the evaluation methodology adopted by prior works [20], [21], [22], [23], [24], benchmark images were first downsampled through the bicubic interpolator; the resulting low-resolution images are then inputted to the FSRCNN model for inference. For a preliminary analysis, we suppose here to adopt a square FR, centered in the middle of the benchmark images, and having an area equal to 4% of the total pixels [14]. Table II collects the PSNR and structural similarity (SSIM) performances obtained by the novel approach, distinguishing between FR, PR, and OA image areas. It also provides an overview of reconstruction qualities, computational complexity (i.e., the number of MAC operations), and the amount of memory needed to store the model parameters (Params) required by state-of-the-art competitors [10], [20], [21], [22], [23], [24]. Considering that [20] implements the model FSRCNN(25, 5, 1) and [21], [23], [24] exploit the configuration FSR-CNN(56,12,4), it can be observed that the proposed approach allows reducing computational and memory requirements by at least 5.11 and 1.09 times, respectively. In the following discussion, the PSNR/SSIM comparison is made considering the average of the percent deviations with reference to the available data.
The quality results extracted for the FR show the ability of the adopted model to obtain PSNR and SSIM values similar to or even better than the nonfoveated SR CNNs [20], [21], [22], [23], [24]. It is important to highlight that, when moving from FR to PR, the achieved PSNR and SSIM degrade, on average, by only ≈9.3% and ≈4%, respectively. Such a degradation causes a reduction of only ≈4.8% in terms of SSIM over the best counterpart [24]. This drop, being confined within regions having low acuity in the HVS, has a limited impact on the perceptual visual quality and it is a reasonable price to pay in view of the ≈94% reduction of the computational complexity. Furthermore, it is worth noting that the obtained OA image qualities overcome those achieved by the foveated-based SR CNN [5]. In particular, the proposed approach allows for improving the PSNR by ≈2.8% on average, while reducing the computational complexity by more than 950 times. Fig. 7 reports some samples of original (top) and reconstructed (bottom) images. It can be noted that the FR, highlighted by a bounding box in the reconstructed images, is always processed while keeping sharp details. This is clearer appreciable, for example, for the benchmark Fig. 7(c), where the women's eyes have a different level of quality: the one belonging to the FR allows distinguishing light reflections and lashes, whereas the one falling within the periphery is characterized by slightly blurred contours. Finally, it must be pointed out that, in most cases, texture details belonging to the periphery are accurately reconstructed through the proposed approximate method. Just as an example, looking at the output

B. Hardware Evaluation
The hardware architecture implementing the proposed modified FSRCNN adopts a layer-folded structure, therefore each layer is accelerated through a custom module performing the specific task. Fig. 8 illustrates the design scheme, consisting of five pipelined circuits. Those named Conv1, . . . , Conv4, being responsible for convolutional layers, include a buffer line architecture, sized depending on the specific kernel size k, and N out processing elements (PEs) that compute, multiply, and accumulate operations in parallel over N in × k × k input pixels. To sustain the highest possible parallelism, the module UP-STAGE finally implements the upscaling task by using 25 HTCONV instances realized as illustrated in Fig. 5 and tailored to operate on 9 × 9 kernels. The results obtained in this way are accumulated through the ACC module that produces the final high-resolution image.
The first prototype of this architecture has been implemented on the Xilinx XC7K410T FPGA device. Table III collects hardware results, including speed performances (reported as maximum running frequency and output throughput), power consumption, and number of occupied look-up-tables (LUTs), flip-flops (FFs), digital signal processor (DSP) slices, and ON-chip block RAMs (BRAMs). The proposed accelerator reaches an output throughput of ≈753 megapixels/s, corresponding to the generation of UHD images at a frame rate higher than 95 frames/s, which perfectly meets the target latency requirements [16].
When compared to prior works [20] and [23], which accelerate the final TCONV layer through innovative filter and pixel decomposition schemes that do not have an impact on the computational complexity reduction, the proposed architecture sustains an output throughput of 1.52 and 15.66 times higher, respectively, even when processing larger input frames. At the same time, in comparison with the architecture proposed in [24], the amount of LUTs, FFs, and BRAMs is reduced by 73.8%, 34.9%, and 51.5%, respectively.
Finally, the proposed accelerator exhibits an energy efficiency 2.2 times higher than the best competitor [20], demonstrating promising characteristics from the perspective of integration within low-power wearable devices like HMDs. To this purpose, the novel design was also synthesized using the STMicroelectronics 28-nm ultrathin body and buried oxide (UTBB) fully-depleted silicon on insulator (FDSOI) 1-V process technology, and the Cadence Genus 1 tool version 19.11. As shown in Table IV, in such a case, the proposed architecture occupies an area of just 2.6 mm 2 and exhibits an output throughput of 1696 megapixels/s at its maximum running frequency, which corresponds to 214 UHD frames/s. Hardware characteristics of the ASIC-based competitors [21], [22], and [25] are also reported in Table IV. Since they targeted different technologies, we scaled the original speed, area, and energy performances according to rules provided in [28].
The proposed architecture overcomes the systems demonstrated in [21], [22], and [25] in terms of both output throughput and area requirements. More specifically, the new design, working on input times larger than [21], reduces the number of gates by 52.8% and increases the output throughput by ≈10 times. At a parity of the process technology, the proposed method also exhibits a ×14 speed-up and a 55% (46.3%) energy (area) reduction over the tile-selective architecture demonstrated in [22]. Furthermore, in comparison with the SRCNN accelerator presented in [25], the proposed design achieves an output throughput ≈24 times higher and an area saving of ≈44%. These significant advantages are obtained at the expense of a selective degradation in the  low-acuity area while improving the reconstruction quality of the most perceived FR.

V. CONCLUSION
In this article, a novel hardware architecture has been presented to reduce the complexity of SR CNN accelerators, thus enabling their integration within the future generation of low-power HMD systems. The proposed system relied on the characteristic of human eyes of capturing fine details only within the small FR. According to this property, the new design performs accurate MAC operations just on pixels that belong to such a region, whereas it reconstructs the remaining pixels through a simple approximation. When realized by using a 28-nm technology process and integrated within the state-of-the-art FSRCNN model, the proposed architecture achieves 214 UHD frames/s dissipating only 0.51 pJ/pixel while keeping perceptual visual quality very close to the state-of-the-art.