A Cost-Effective Interpolation for Multi-Magnification Super-Resolution

Super-Resolution (SR) was an important research topic, and SR methods based on Convolutional Neural Network (CNN) confirmed its groundbreaking performance. However, notably implementing the CNN model into resource-limited hardware devices is a great challenge. Therefore, we present a hardware-friendly and low-cost interpolation for Multi-Magnification SR image reconstruction. We follow our previous work, which is a learning-based interpolation (LCDI) with a self-defined classifier of image texture, and extend its original <inline-formula> <tex-math notation="LaTeX">$\times 2$ </tex-math></inline-formula> architecture to <inline-formula> <tex-math notation="LaTeX">$\times 3$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\times 4$ </tex-math></inline-formula> architecture. Besides, the required pre-trained weights are reduced by the fusion scheme. Experimentally, the proposed method has only 75% lower pre-trained weights than LCDI. Compared to the related work OLM-SI (One linear learning mapping-SI), the run-time and quantity of pre-trained weights of the <inline-formula> <tex-math notation="LaTeX">$\times 2$ </tex-math></inline-formula> proposed method are at least 90% lower. Compared to CNN-based SR methods, the proposed method loses a little lower performance, but the evaluation of computational cost is much lower. In conclusion, the proposed method is cost-effective and a practical solution for resource-limited hardware and device.

as well, confirming that SI only costs very few hardware 80 resources to achieve competitive performance. Based on [22], 81 Bae et al. [23] developed a more simplified classifier to 82 achieve similar performance to [22]. Since the local binary 83 pattern classification is used in [23], we call it LBPC-SI in 84 this paper. 85 Likewise inspired by SI, we observe that it is still ame-86 liorable; we proposed a modified method of SI in our previous 87 work [24], called Learning-based Cross-directional Interpo- applies the 1-D interpolation. The main advantage of 1-D 93 interpolation is lower computational complexity than 2-D. 94 Moreover, we proposed an efficient classifier for 1-D patches 95 to improve the quality of the reconstructed image. To confirm 96 our method is suitable for hardware, we implement it on 97 ASIC and FPGA. From the experimental results, compared 98 to [22], the required hardware resource of our LCDI-3 is 99 lower, and the evaluated metric of quality of the reconstructed 100 image is similar. Compared to the hardware implementation 101 of the CNN-based method [25], [26], [27], [28], our LCDI-102 7+ only required much fewer hardware resources without 103 much loss of the quality of the reconstructed image. Note that 104 the comparison results with [27] are not provided in [24], but 105 we get the same conclusion mentioned above. 106 Here, we briefly introduce the studies [25], [26], [27], 107 [28] on implementing CNN-based SR methods on hard-108 ware, most proposed modified and lightweight models. Based 109 on FSRCNN-s, the deconvolution accelerators of SR meth-110 ods are presented in [25] and [26]. Lee et al. [27] replace 111 the Deconvolution Layer of FSRCNN-s with the Sub-Pixel 112 Layer so that the cost of their weight is about 60% of 113 that of FSRCNN-s and achieve almost similar performance. 114 Kim et al. [28] designed a novel CNN-based SR model and 115 its hardware: Real-time SRCNN (RTSRCNN). 116 In [24], we proposed an efficient SR method and its hard-117 ware implementation, but we only developed ×2 SR architec-118 ture, and there is a problem that the quantity of pre-learned 119 weights is too much. Therefore, in this work, we develop 120 ×3 and ×4 SR architecture based on the LCDI-7+ and 121 attempt to reduce the required pre-trained weights. The main 122 contributions of the proposed method are listed as follows: 123 1) Based on LCDI-7+, we apply the fusion scheme to 124 reduce the required pre-learned weights. The quantity 125 of pre-trained weights of the proposed ×2 SR architec-126 ture is a quarter of that of LCDI-7+ without noticeable 127 compromising performance.

128
2) The original classifier of LCDI-7+ is improved. 129 Since the number of its classes is descended 130 from 2025 to 625, the quantity of pre-learned weights 131 is reduced by around two-thirds.  [24] are the base related works in this paper, and they can 174 be regarded as 2-D and 1-D interpolations, respectively. The from an IR image. Among both interpolations, the desig-189 nated classifier is applied first, and then the corresponding 190 pre-learned weights are obtained and used to compute the 191 value of pixels of IR and HR images.

192
Based on a detailed statement in [24], the critical reason, 193 why the computational complexity of 1-D interpolation is 194 lower, is the reusable pixels of the IR image. Back to Fig.2(b), 195 the gray spots and orange spots mean pixels of LR and HR 196 images, respectively, and the yellow spots are regarded as 197 pixels of IR images. Since the pixels of the IR image are 198 reused to generate the pixels of the HR image, the total 199 number of operations is reduced. Based on the results in [24]

214
LCDI-7+ has confirmed its low computational complexity, 215 but only ×2 SR architecture is provided. Therefore, we follow 216 it and develop this architecture for ×3 and ×4 SR methods. 217 Moreover, we think the quantity of pre-learn weights is too 218 much, so we employ a fusion of pre-learn weights. Experi-219 mentally, our method has a 75% lower quantity of pre-learn 220 weights and is more cost-effective than LCDI-7+.

222
In this Section, we elaborate on our method, including the 223 fusion of pre-learned weights and extended ×3 and ×4 SR 224 architecture. Moreover, we state how to tackle the fusion of 225 weights in the pre-training and interpolation process and the 226 simplified classifier of 1-D interpolation.

228
Since the quantity of pre-learned weights is too much in 229 LCDI, we attempt to merge its four groups of pre-learned 230 weights into one. An example of ×2 SR architecture is shown 231 in Fig.4, including vertical and horizontal interpolation. The 232 number label #1, #2, #3, and #4 indicate four groups of pre-233 learned weights in LCDI-7+ [24], but we attempt to fuse 234     For the ×2 SR method based on 1-D interpolation, one pixel 258 of LR image can be generated to two pixels of IR image, 259 and one pixel of IR image brings two pixels of HR image. 260 Following the ×2 SR architecture, we develop the ×3 and 261 ×4 SR architecture. As shown in Fig.5, there are two stages 262 in the proposed ×3 and ×4 architecture likewise. For ×3 263 SR architecture, whatever vertical or horizontal interpolation, 264 both are imaging processes from one pixel to three pixels. For 265 ×4 SR architecture, it is an imaging process from one pixel 266 to four pixels in vertical and horizontal interpolation.

267
The proposed fusion of pre-learned weights is employed 268 for ×3 and ×4 architecture as well. Since ×3 and ×4 SR 269 architecture is not provided in LCDI-7+, we assume the 270 condition that we use separate groups of pre-learned weights 271 in ×3 and ×4 SR architecture. As shown in Fig.6       the weight, to generate two outputs. Therefore, 14 weights, 314 which belong to two groups, are trained. The number of the 315 categories is N, and the total 14 * N weights are pre-trained. 316 Finally, we briefly introduce the interpolation process. 317 Following LCDI-7+, we employ two stages to up-scale 318 the images, including vertical and horizontal interpolation, 319 as shown in Fig.3. The main difference is that we didn't 320 require two different tables of weights and only accessed the 321 same table in both stages. Likewise, we take ×4 architecture 322 as an example, which contains two-stage computation. The 323 vertical interpolation is (2) Fig.6 (d).

333
In this Section, we discuss the classifier of 1-D interpolation. 334 In LCDI-7+, we adopt the granularity of the gradient of the 335 contiguous pixels to discriminate the property of the 1-D 336 patches. As shown in Fig.8 (a), we compute the four gradients 337 G 0 , G 1 , G 2 , G 3 of the contiguous pixels of the 1-D patches, 338 then granularize the gradient value based on different grades. 339 Besides, the required operations of this classifier are only 340 four subtractions and some comparisons, which make it more 341 hardware-friendly.

342
For LCDI-7+, the G 1 , G 2 employ the rule of nine grades, 343 as shown in Fig.8 (b), and the G 0 , G 3 employ the rule of five 344 grades, as shown in Fig.8 (c). The total number of classes is 345 2,025 (= 5 × 9 × 9 × 5). However, we think the quantity is 346 still too much, and we would attempt to adopt the rule of five 347 grades to G 0 , G 1 , G 2 , G 3 . The total number of classes can 348 be reduced to 625 (= 5 × 5 × 5 × 5).   In this paper, we follow our previous work LCDI-7+,     [30] and high-resolution images named DIV2K [31].

379
As observed in FSRCNN [11], the image quality is slightly 380 better when trained by ImageNet than by T91 images and 381 G100. Therefore, T91 and G100 image datasets contained 382 enough variability of natural images, which can achieve the 383 same results as other larger datasets. In recent years, the tural similarity index measure (SSIM), so we compute , and LBPC-SI [23], which are considered the SI-441 based SR methods, and the primary competitors by us. Two 442 versions F and S of SR-LUT, also learning-based inter-443 polation, are considered our competitors. Compared with 444 CNN-based SR methods, we choose the lightweight model 445 FSRCNN-s [11] and [27]. Based on the title of the arti-446 cle [27], we abbreviate it HED-SR in this Section. Proposed 447 A and B represent two versions with identical interpola-448 tion architecture but different 1-D classifiers and pre-learned 449 weights.

450
Since many versions of improved Bicubic are proposed, 451 we assume a lower bound of the computational cost of it 452 straightly. Considering that 16 pixels must be referenced 453 in Bicubic, we use 1-D interpolation with four referenced 454 pixels to evaluate its computational cost. From the results, 455 the required computational cost of the proposed methods 456 is close to that of Bicubic, but ours achieves a noticeable 457 improvement in the quality of the SR image.

458
For SI-based methods, which is a 2-D interpolation, OLM-459 SI presents the results of the ×2, ×3, and ×4 SR, but SI, 460 HSI, and LBPC-SI only provide ×2 SR method. For the 461 evaluated cost, the class numbers of the four methods are 462 given in their original papers. The weights number of SI, 463 OLM-SI, and LBPC-SI and the computational cost of the 464 four methods may not be provided clearly. We speculate 465 these values in TABLE 6, 7, and 8 from the description in 466 VOLUME 10, 2022 a bit more storage and computational cost but get much better 488 performance.

489
Next, SR-LUT is denoted as learning-based interpolation, 490 but its architecture is different from SI widely. SR-LUT 491 requires even less multiplication than Bicubic. However, con-492 sidering the size of the look-up table, version F seems to 493 be much less than version S. In comparison with SR-LUT, 494 they require a bit less operation than the proposed methods. 495 However, regardless of magnification, Proposed A and B 496 have notable advantages considering the performance.

497
Compared with CNN-based SR methods, we take 498 FSRCNN-s and HED-SR, lightweight CNN-based SR meth-499 ods, as the primary competitors. Implementing CNN-based 500 SR methods on hardware is challenging, so most hardware 501 studies of CNN-based SR methods [25], [26], [27], [28] 502 implement lightweight models. Although we don't provide 503 hardware implementation in this paper, we compare the 504 evaluated number of operations as computational costs. From 505 this result, we think the proposed methods have much less 506 computational cost than FSRCNN-s or HED-SR and only 507 lose a little performance in PSNR. Besides, our previous work 508 [24] has confirmed this architecture is hardware-friendly and 509 cost-effective compared to others. Next, the comparison for the actual image, which is 511 regarded as the subjective term, is presented in Fig. 9, Fig. 10  proposed method was applied to the real application of image 553 up-scaling or SR.

554
Another issue about the calculation of run-time is dis-555 cussed here. We measure the run-time by using ''tic'' and 556 ''toc'' functions on MATLAB. After a program is compiled 557 on MATLAB, the run-time in its first testing is higher than 558 in its later testing, and TABLE 9 presents the first, second, 559 third, and fourth tested values of run-time for ×2 SR in testing 560 dataset Set5. From our observation, the run-time becomes a 561 stable value from the third test, so we adopt the third tested 562 run-time as an experimental result in TABLE 6, TABLE 7 and 563  TABLE 8.

565
In this paper, we propose a hardware-friendly and cost-566 effective interpolation for Multi-Magnification ×2, ×3, and 567 ×4 SR image reconstruction. The main benefit of our method 568 is low computational complexity and the achievement of 569 competitive quality of the image. The proposed method 570 exploits 1-D interpolation, in which operations are lower 571 than 2-D interpolation. Besides, to achieve the finer quality 572 of the image, we design an efficient 1-D classifier and use 573 machine learning to get more optimal interpolated weights. 574 Finally, we use the fusion concept to reduce the number 575 of pre-training weights. Experimentally, compared to the 576 related work OLM-SI [21], which uses 2-D interpolation, 577 the run-time and quantity of pre-trained weights of the pro-578 posed method reach an average of 90% lower for different 579 magnifications. Compared with the recent work SR-LUT 580 [19], there are pros and cons. The computational cost of 581 the proposed SR method is slightly higher than theirs, but 582 our performance is notably better. Compared to lightweight 583 CNN-based SR methods, the proposed method has even less 584 evaluation computational cost but only results in a little loss 585 of performance. Notably, implementing the CNN model into 586 resource-limited hardware devices is a great challenge, but 587 the proposed method is more hardware-friendly. In conclu-588 sion, the proposed method is cost-effective and a practical 589 solution for resource-limited hardware and device.

590
The authors would like to thank Dr. Yunqiang (Charles) Chen 592 from Qualcomm, anonymous reviewers, and editors for their 593 valuable comments and suggestions to improve the quality of 594 the paper.