SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for Halide

Filtering is a fundamental tool in image processing, and its acceleration affects many applications. Therefore, various algorithmic and hardware accelerations have been proposed for filtering. Recursive processing using infinite impulse response (IIR) filtering is an efficient algorithm, and various hardware acceleration methods have been applied to IIR filtering. In addition, a domain-specific language (DSL) of RecFilter was proposed to generate efficient IIR code for various hardware applications as an extension of image processing language, Halide. Recursive filters based on sliding discrete cosine transform (SDCT) have been the most efficient approximations in recent years. For hardware acceleration, parallelization of recursive filters is challenging. One of the most efficient methods is tile-based parallelization. However, even if a function is optimized and modularized, it is not sufficiently optimized for applications where various pre/post-processing steps are coupled before and after filtering. Additionally, multiplatform deployment requires reimplementation of the code. In this study, we extended Halide for SDCT convolutions to realize efficient computing of image processing applications with filtering, named SlidingConv. The experimental results showed that SlidingConv is faster than the hand-tuned CPU code and 1/1900 of the hand-tuned code length, running more efficiently than de facto libraries like OpenCV. To verify its efficiency, we deployed the code on various hardware (x86/64 CPU with AVX2/AVX-512, ARM CPU, and GPU). In addition, we verified that the proposed method can accelerate image processing with pre/post-processing for filtering. Our code is available at https://fukushimalab.github.io/SlidingConv/.


I. INTRODUCTION
Spatial filtering is an essential image processing tool in computer graphics, computational photography, and low-level vision applications.Accelerating spatial filters is important because they are used in various applications.The computational cost depends on the filtering window size.

A representation of 2D spatial filtering is finite impulse
The associate editor coordinating the review of this manuscript and approving it for publication was Andrea F. Abate .response (FIR) filtering, whose computational order is O(R 2 ), where R is the radius of the convolutional kernel.Speeding up the filtering requires algorithmic and hardware acceleration.If a kernel is separable, then the convolution can be computed in O(R) by decomposing it into two 1D kernels.In addition, the fast Fourier transform (FFT) reduces the convolution order to O(log N ), where N is the number of signal samples.However, the number of scans of the entire image increases for these accelerations, reducing the cache efficiency (two for separable filtering and three for FFT).These algorithms have the following optimized libraries: OpenCV, Intel Integrated Performance Primitive (IPP), AMD ROCm Performance Primitives (RPP), NVIDIA Performance Primitives (NPP), FFTW, cuFFT, and clFFT.In addition, a domain-specific language (DSL) [1] for image processing, called Halide [2], can deploy optimized binaries for various hardware.The Halide language can separate descriptions of the algorithm (image processing) and scheduling (loop reordering, parallelization, and vectorization).This separation enables speed improvements and various hardware deployments with a simple description.
Infinite impulse response (IIR) filtering [3], [4], [5], [6], [7], [8] can reduce more orders.The IIR filter is realized by recursive processing, which allows it to be computed at a constant time of O(K ) per pixel, where K is the approximation order (K ≪ R in most cases) and is independent of the window size.The 1D IIR filter requires two pass filters, casual and anti-casual directions, and is realized by two IIR systems: parallel and serial systems.The parallel system is the sum of the causal and anticausal results, independently adding forward and reverse filters [3], [4], [5].The serial system is a product of causal and anticausal results: forward convoluting an input and then reverse convoluting the forward result in a cascade manner [6], [7], [8].In the 2D or multidimensional signal cases, 1D filters are applied as a cascade 1D filter in a separable manner.The IIR filter was implemented in Insight Toolkit (ITK) [9], developed mainly for medical image processing libraries.
Tile parallelization code is complex.In addition, convolutions are chained to various additional pre/post-processing steps.A simple example is unsharp masking, which considers the difference between the original image and convolutional image and then adds it to the original image.Furthermore, floating-point operations on 8-bit images require casting as additional pre/post-processing steps.This pre/ post-processing reloads the data (i.e., scans the entire image again before and after the filters), reducing the cache efficiency.Loop fusion for pre/post-processing and filtering loops improves cache efficiency; however, implementing loop fusion codes requires rewriting a large amount of code.Moreover, such code-writing cannot be modulated as a function or class, resulting in low portability.
RecFilter [19] was proposed to resolve this modulation issue.RecFilter allows the tile-by-tile parallelization of IIR filters by extending Halide.Recursive filtering requires help with separated descriptions of algorithm and scheduling because Halide mainly targets stencil computing, whereas recursive filtering involves scan computing.Recursive filtering requires initialization processing for each tile, which interleaves the algorithm and scheduling descriptions.RecFilter solves this problem by generating Halide codes.In particular, RecFilter efficiently realizes IIR serial systems where all filters are cascaded with a specific initialization for per-tile parallelization.Moreover, IIR filters can be decomposed into parallel and serial systems to represent the approximate order as a combination of lower-order filters [22], and RecFilter allows constructing filters of appropriate order with parallel systems.
As a recent algorithmic improvement, sliding discrete cosine transform (SDCT) filtering has realized an arbitrary FIR filter as a recursive filter [23], [24], [25], [26], [27], [28], [29].SDCT can represent FIR filtering as short-time DCTs and compute them as recursive filtering.The computational order is O(K ), as in IIR filters.SDCT filtering can be computed with finite taps instead of an infinite initial length owing to the FIR property.Moreover, the filter can be implemented using a forward filter only (i.e., the number of processes is halved from IIR without an anti-causal filter).Therefore, SDCT filtering requires fewer image scans, and its initialization is easier than IIR filtering.In addition, any even function can be easily designed into a convolution with FIR weights as input, whereas IIR filtering is difficult to design arbitrary weights.
Code complexity in per-tile parallelization for SDCT filtering is inherited from recursive IIR filtering.As an example of increased complexity, the image quality evaluation index, structural similarity (SSIM) [30], which uses an internal Gaussian filter, has been accelerated using an SDCT filter [31], and another [32] which uses a constant-time bilateral filter with SDCT filter and tiling to improve efficiency.However, these applications require considerable code rewriting to realize and deploy applications in various architectures.
Therefore, we propose a domain-specific description for SDCT filtering, named SlidingConv.SlidingConv decouples the scheduling and algorithm descriptions for SDCT filtering, generating a high-performance code with fewer descriptions.Fig. 1 shows a visual overview of Sliding-Conv.SlidingConv generates scheduling-aware Halide codes from pre/post-processing, user-defined kernel, scheduling, and SDCT algorithm descriptions.The generated code is compiled for deployment in various architectures.Halide is not a Turing-complete language; thus, Halide cannot always describe schedules and algorithms separately.Slid-ingConv solves this contamination problem by acting as a Halide code generator.It also functions as a DSL, inheriting Halide's characteristics as a scheduling descriptive language.Similarly, SlidingConv can be embedded in Halide codes.
This behavior is similar to RecFilter; however, the three problems cannot be solved directly using the conventional RecFilter.First, RecFilter cannot realize recursive filtering in a parallel system.Second, RecFilter initialization is specific to IIR.Third, SDCT filtering uses different data for each approximation order of the parallel system in the recursion, while RecFilter does not require this.Moreover, regarding RecFilter's functionality, only pre-processing can be described with filtering (i.e., no kernel fusion in post-processing [33]).
The contributions of this paper are as follows: • We propose the first DSL description for SDCT filtering.We support the parallel system of recursive filters, specialize in initialization, and extend the recursive reference data.
• We support loop and kernel fusion for pre and postprocessing.The functionality enables cache-efficient descriptions for image processing that use filtering.
• We enable various CPU and GPU backends.Sliding-Conv is the first implementation of the SDCT filter on GPUs, ARM CPUs, and AVX-512 CPUs.Since SlidingConv outputs Halide codes, it is easily adaptable to various backends but provides unbreakable manual scheduling for the SDCT filter.Table 1 summarizes the algorithms and implementations of convolutional filters.

II. PRELIMINARIES A. SLIDING-DCT FILTERING
This section presents the details of SDCT filtering [28].Here, we present one-dimensional (1D kernel) cases because the   (f * g)(x) ← K k=0 Z k (x) 21: end for cascaded processing of 1D kernels can realize multidimensional kernels with separability.Algorithm 1 provides an overview of SDCT filtering flow.  of FIR filtering (f * g)(x) is defined as follows: ( Here, we consider (1) in the DCT domain.The cosine is an even function; thus, we considered only even functional kernels.The weights g(n) can be represented by the following inverse DCT from the DCT coefficients where the variables T (or φ = 2π T ), k 0 , and n 0 depend on their type of DCT.DCT can be divided into eight types depending on how the basis, signal, and period are treated as discrete signals for continuous signals, as summarized in Table 2.The coefficients in the matrix form G = [G 0 , G 1 , . . ., G R ] T ∈ R 1×R+1 can be represented using least squares [28] as follows: where g = [g(0), . . ., is the cosine kernels and W = diag( 1 2 , 1, . . ., 1) ∈ R R+1×R+1 is the weight for dealing with the kernel symmetry.
Convolution (1) can be represented using (2): where F k (x) ∈ R.This expression is O(R 2 ) for a 1D convolution whose complexity is higher than that of the original convolution of (1), which is O(R).
The sliding transform can significantly reduce the order of the increase.This transform utilizes the second-order shift property, which is a relational expression of three short-time transform coefficients for each k: F k (x − 1), F k (x), and F k (x + 1).The relationship is defined as where the function k (x) ∈ R is defined by the multiply-add of the input f and coefficients C k : k (x) function can be simplified by extending the terms cos and sin.We can use the following relationships.
• DCT-I: x for each DCT type introduced in [28].Furthermore, premultiplying F k (x) by G k on both sides of ( 6) simplifies (4). where The DCT closely approximates the response with fewer coefficients; thus, we can truncate the convolution by a limited value, K ≪ R. K ∈ N is the approximation order.The approximate definition is as follows: As Eq. ( 6) can be updated with two multiplications for each component, the filter of (10) can be approximated by O(K ), which does not depend on the size of the kernel.We focused on odd types (DCT-I, III, V, and VII) for the usual odd-size filtering in this paper.The even types of DCT-II, IV, VI, and VIII are for an even kernel size (2,4,6,. . .), used for upsampling and downsampling because the axis is n 0 = 1 2 .

2) DIRECT CURRENT SPECIALIZATION FOR DCT
SDCT can be specialized when k = 0 because DCT-I and DCT-V are k 0 = 0 and C 0 (n) = cos(0) = 1 according to Tab. 2, called a component of direct current (DC).
7566 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
As C 0 (n) = cos(0) = 1, the DC of F 0 (x) can be represented as This relationship is similar to the box filter; thus, it is simple.Thus, (8) can be specialized when k = 0 as Using (12), Z 0 (x) can be computed with fewer operations and higher accuracy than using (8) because there is no round error in floating-point evaluation when multiplying G k and C k (n).Furthermore, because K is usually small, this specialization is crucial for speedup factors.

3) OPTIMIZATION OF RADIUS
Given the order K and weight g, it improves filtering accuracy to optimize the convolution radius R for matching the given parameters [26].This section introduces the optimization approach.
As the FIR filter terminates weights up to a radius R, we should consider the reproduction accuracy of the effective range and ignore the weights of the outer long-tailed portions.The former is the integral of the squared error of the kernel weight g(t) and DCT approximated weight g(t), and this definition is as follows: where η is the integral of the weight for normalization (e.g., √ 2π σ in the Gaussian kernel).The later of the squared errors of the outer kernel function is as follows: Then, we minimized the total error to find the optimal R using a linear search, binary search, etc.
arg min This optimization does not affect the filtering speed because it can be calculated before filtering.

B. HALIDE
Halide [2] is a major DSL embedded in C++ for high-performance image processing.It describes the code in the algorithm and scheduling parts separately.The algorithm part includes essential processing and can be described as hardware-independent parts.The scheduling part includes performance tuning of processing and can be described for each hardware architecture.Based on these two parts, a Halide compiler can automatically optimize the code's performance and deploy it to various hardware.Halide is updated continuously and officially [34], [35], [36], [37].Program 1 presents an example of Halide 3 × 3 box filtering, and Table 4 lists the main classes and methods for Halide scheduling.The Func class is the body of functions, which defines functions using dimensional variables.In the actual pipeline, pure definitions are executed, and updated definitions are executed in the order they are defined.In the update, the functions can be defined using reduction domains.In the algorithm part, the input image is padded to prevent external references to the input image, as described by BoundaryConditions::repeat_edge, and then separable 3 × 1 filtering is performed.In the scheduling part, blur_y is split into 32 × 32 tiles, and variables x and y are split into inner and outer variables.The inner variable xi in x is then vectorized by a width of 8, and the outer variable yo in y is parallelized.blur_x's computation timing is set to compute_at(blur_y, xo) that only the necessary pixels are computed and stored within the xo-loop of blur_y.Vectorization is also performed for xi with a width of 8.

C. RECFILTER
RecFilter [19] is a Halide extension for cascaded IIR filtering that internally generates Halide codes to overcome Halide's limitations in recursive filtering.Program 2 shows an example code of RecFilter for IIR Gaussian filtering.RecFilter is a class body, RecFilterDim is a dimension variable in RecFilter.The method add_filter adds IIR filtering, which has the first argument for setting the filtering direction and the second for coefficients.For example, add_filter(+x, {a, b, c}) is the definition of an IIR filter, F(x) = af (x)+bF(x −1)+cF(x −2).The function gaussian_weights returns the coefficients for IIR Gaussian Filtering.The split method converts a tile dimension into a specific width, similar to Halide scheduling, but specialized for IIR filtering.Then, cpu_auto_schedule sets an appropriate schedule with a specific vectorization width from set_vectorization_width for RecFilter.

III. LIMITATIONS OF RECFILTER FOR SLIDING-DCT-BASED FILTERING
We clarify the limitations of RecFilter for IIR filtering by trying to realize SDCT filtering of Algorithm 1 by RecFilter.Program 3 presents the pseudo-RecFilter code for the SDCT filter.Lines 4-6 in Program 3 represent lines 3-4 in Algorithm 1, representing the initialized convolutions of the first and second pixels.In addition, Algorithm 1's line 13 or 15 is implemented as a recursive process using RecFilter's add_filter, where the array of the second argument contains the feedforward coefficients in the first element and the feedback coefficients in the last elements.The second and third elements are 2C k (1) and −1, respectively, because the second and third terms are 2C k (1)Z k (x − 1) and −Z k (x − 2).The first term is G k k (x − 1), which changes depending on the pixel.The first array element is set to 1, and the value of G k k (x − 1) is assigned to the expression of RecFilter.Its description is add_filter(+x, {1, 2 * C 1 [k], -1}).In line 10 of Program 3, select assigns values equivalent to G k k (x − 1) to pixels other than 0 and 1 in RecFilter.However, Program 3 has three problems: 1) loop splitting, 2) no starting point, and 3) fixed-input image problems.
First, RecFilter runs one image scan for each recursive filter.In the image-scanning loop of Algorithm 1's line 11, there are K +1 updates for Z k per pixel, but the processing can be realized by one loop.The existing RecFilter runs the image scanning loop per filter K +1 times, added by add_filter to the for-loop (i.e., K + 1 times image scanning loops), such as loop fission.This redundant loop structure causes much of the cache miss.In addition, the filters added by add_filter are executed in added order in a cascaded manner, and the second and subsequent filters use the result of the previous filter as an input image.Therefore, convolution using SDCT cannot be realized when multiple filters are processed independently of the input.
Second, RecFilter cannot specify the start pixel of the filter because RecFilter is for IIR filtering, which considers an entire image as an argument.Sliding-DCT-based filtering is an FIR filter that requires a finite-size kernel to be convolved with the pixel.In Algorithm 1's line 2, the first two elements are convolved with the initial values instead of the update processing for the sliding transformation.Whereas IIR filtering does not require the specification of the start pixel, RecFilter does not have the specification functionality.
Third, RecFilter cannot change the input images in the internal RecFilter class.IIR filtering refers to a pixel in the input and the computed images for a newly computed output.SDCT filtering uses the input image in k (x) computation, which can be used for input like IIR filtering.However, k (x) must be specified for each recursive filter because k (x) varies with k in the update process in Algorithm 1's line 13.This requirement interferes with its realization in RecFilter FIGURE 2. The difference between tile scheduling for recursive filtering by Halide and SlidingConv.The first two pixels are initialization calculations, and after that, sliding updates are performed; in the case of Halide, initialization cannot be inserted; thus, the neighboring tiles must also be calculated, while the proposed method completes calculations on a per-tile basis.because RecFilter enforces one image as the filtering input for each cascaded filter.
Because of these three limitations, the existing RecFilter cannot realize SDCT filtering of Algorithm 1.

IV. PROPOSED METHOD
The native Halide code performs massive redundant computations for SDCT filtering with tiling parallelization because each tile requires the results of the previous tile.Therefore, recomputations of the previous tiles are required.The tiling coordinates in the algorithm part must be hardcoded to avoid this condition.SlidingConv can generate such schedule-aware codes using a code that decouples the algorithm from scheduling.Fig. 2 shows the computational scheduling of Halide for SDCT filtering and that of the proposed SlidingConv.
In addition, SlidingConv solves the limitations of Rec-Filter.Furthermore, SlidingConv allows pre/post-processing of filters to simultaneously, which realizes cache-efficient implementations and provides the functionality of kernel settings in SDCT filters.

A. SOLUTIONS FOR THE RECFILTER'S LIMITATIONS 1) LOOP SPLITTING
For the loop splitting problem, we changed RecFilter to handle parallel systems in a single-loop structure.We newly create as many Func objects as the number of filters inside a RecFilter instance to merge the loops using the compute_with method, and each Func independently refers to the input of RecFilter.The code should specify update(0) for scheduling because RecFilter uses RDom to create the filter loop.Program 4 shows this modification.

2) NO-STARTING-POINT PROBLEM
To address the no-starting-point problem, we added a method for specifying the initial values necessary to start the update process.We added set_init_func method to allow this functionality.When the filter added by the add_filter method does not have the elements necessary to perform the update process, it refers to the Func specified by this method.To specify the initial values for each filter, the same number of Func as the number of times add_filter passes to set_init_func.The first two elements of signals should be specified by the specialized Func for initial convolutions, named initFunc, allowing RecFilter to start updating the sliding transform from the third element.Inside the internal code of the extended RecFilter (Program 5), the select method with the starting point condition switches between the initial value specified by set_init_func and the updated definition to define the starting point.Program 6 shows the definition of initFunc.

3) FIXED-INPUT IMAGE PROBLEM
We added a method for specifying the reference pixels per filter for the fixed-input image problem.We added the new method, set_delta_func, to allow RecFilter to have this functionality.The filter added by add_filter refers to the Func specified by set_delta_func.This allows RecFilter to specify k (x) that varies with each k and to refer to a different value for each filter.Program 6 shows the definition of set_delta_func.

B. SLIDINGCONV
SlidingConv is implemented in a class in Halide language, defined SlidingConv.This is an inherited version of the above improvements to the limitations of RecFilter.SDCT filtering realizes an FIR filter that differs from the IIR filter realized by RecFilter.FIR filters can consider the convolution results more easily than IIR filters; however, they require accurate parameters for the recursion process.SlidingConv also provides transformation and optimization functions for this purpose.Image processing rarely ends with a filter alone but often runs some processes before and after the filter.SlidingConv also allows combining these processes to generate more efficient codes.Table 5 shows the list of methods in SlidingConv.These functionalities are introduced step by step.
Program 7 shows an example code for unsharp masking that uses all the additional functionalities of SlidingConv.The unsharp masking is defined as follows: SlidingConv is the body for SDCT filtering.Sliding-Conv's functionalities are described according to Program 7.

1) PRE-PROCESSING
First, we initialized the pixel values in SlidingConv by a given Func or Buffer object.Func can contain any image processing; thus, it can write arbitrary pre-processing.The pre-processing is within the loop of the SDCT filtering by default (i.e., no loop splitting).If we plan the pre-processing separately, these processes run individually.In Program 7, pre-processing is converting the data type from unsigned char to float.No pre-processing is required if we initialize SlidingConv by the SlidingConv input.

2) ALGORITHM SETTING
Next, we specified an algorithm for SlidingConv.The set_kernel method sets the kernel weight defined by the std::function, and set_radius sets its radius.Program 8 is an example of the kernel weight as a Gaussian weight (σ = 3).However, this std::function does not limit the kernel weight to Gaussian, and an arbitrary even function is available.The weight is described by the type std::function<double(int, double, int)>, whose arguments are maxRadius (window radius), offset (n 0 of DCT), and r (radius of required value).If r = 0, the code executes the automatic radius determination described in Sec.II-A3.The computational cost of this description does not affect the convolution itself because it is called before filtering, and the cost is lower than that of filtering.
In addition, set_algorithm sets the DCT type of SDCT, and set_order sets its approximation order.The filtering time and accuracy can also be controlled by specifying the radius, algorithm, and order.These parameters define the characteristics of SDCT filtering.

3) POST-PROCESSING
We can write arbitrary post-processing, running within the SDCT convolution loop (i.e., no loop splitting) with the set_post_process_func method.In Program 7, the process corresponding to (17) is performed as a post-processing step.If we do not call the setting function, we can omit post-processing.The writing form is different from that of pre-processing.If SlidingConv can accept the Func output, we can write the following form, which is similar to pre-processing; Func(x, y) = SlidingConv(x, y)+2 * SlidingConv(x, y).However, the current implementation has not supported this form due to the limitations of Halide in the language description.
Additional pre/post-processing is limited to point (pixelwise) operators.Polyhedral optimization must be considered [38], [39] for extending to stencil or areawise operations.

4) SCHEDULING
Finally, we set schedules for SlidingConv.SlidingConv has the same functionalities of Halide listed in Table 4. Additionally, we provided two scheduling scope methods: scheduleX (for x-direction) and scheduleY (for y-direction).We cascaded two SDCT filters in the x-and y-directions; therefore, SlidingConv should be scheduled in each direction.We provided the cpu_auto_schedule method for default scheduling instead of manual scheduling.In addition, GPU scheduling for the OpenCL or CUDA backend is possible using gpu_thread and gpu_block methods for manual setting, or gpu_auto_schedule for the default schedule.The scheduling methods described in Program 7 are equivalent to that in cpu_auto_schedule.We provided tile(width/2,16) for X and tile(16,height/2) for Y because they have good speeds for any kernel, order, image size and CPU.

V. EXPERIMENTAL RESULTS
We evaluated SlidingConv based on performance, functionality, and descriptivity and made 9 experiments.
For performance evaluation, the first experiment compared the proposed method with the hund-tuned SIMD code for sliding DCT algorithms1 (Sec.V-A1).The second experiment evaluated the filters of various algorithms and implementations based on the computational time and accuracy (Sec.V-A2).The third experiment verified the effectiveness of the proposed method for various architectures by justifying its speed (Sec.V-A3).
For functionality evaluation, we conducted four experiments.The first is the scheduling test for the CPU and GPU (Sec.V-B1), which verifies the parameter of the tiling parallelization.The second is the effectiveness of the preand post-filtering features (Sec.V-B2), justifying the performance enhancement of the image-processing chain.The third is the DC optimization (Sec.V-B3), which examines the effectiveness of DC specialization.The fourth is radius optimization (Sec.V-B4), demonstrating the effectiveness of the radius setting.
For descriptivity evaluation, we evaluate two aspects: description efficiency and generated Halide code analysis.We verified the description efficiency (Sec.V-C1) and analyzed SlidingConv's intermediate code using the generated Halide codes (Sec.V-C2) and loop structures (Sec.V-C3).
We set the default schedule as cpu_auto_schedule or gpu_auto_schedule and the default algorithm as DCT-V.We automatically calculated the radius by setting it to 0 based on (16).Table 6 lists the computers used.CPU1 and GPU1 were the default computers.The codes used are listed in Table 7.The compiler used was Microsoft Visual Studio Professional 2019 for the CPUs, except for the ARM CPU (using gcc).

A. PERFORMANCE 1) COMPARISON WITH HAND-TUNED CODE ON X86/64 CPU
We compared SlidingConv with a hand-tuned AVX SIMD implementation (over 20,000 lines of code) [21] using x86/64 CPUs.We conducted this experiment to show how closely the performance of the proposed method approaches that of the hand-tuned code.
Fig. 3a shows the speed with respect to the approximation order for each type of DCT.SlidingConv was faster than SIMD for all DCT types and orders; however, the difference was smaller from an order of approximately 8. SlidingConv's optimization can accelerate it faster than SIMD, but SlidingConv natively stores more data per order than SIMD.SDCT filtering requires the last two computed results per order for a sliding update of (8), and the older output can be discarded.Ling buffering achieves an effective implementation; however, Halide's specifications do not allow this implementation.Halide only allows the storage of calculation results in loops; it is impossible to keep only the previous result in the loop in the ring buffer.Therefore, SlidingConv, a DSL for Halide language, uses more CPU cache space per order than SIMD and increases the frequency of cache misses with increased orders.
Fig. 3b shows the filtering speed with respect to σ for each DCT type.For all types and orders of DCT, SlidingConv is faster than SIMD but is more affected by σ than SIMD because SlidingConv is more parallelized by tiles.Each tile requires the initial processing of O(r) convolutions for each order and direction at the start of two pixels; thus, a larger σ (i.e., radius) has a greater impact than smaller SIMD tile cases.The y-scale is quite small, with almost constant-time filtering for each σ .
Fig. 3c presents the approximation accuracy for each order.The accuracy depends on the DCT type, but it is noteworthy that it exceeds 80 dB for approximation order 3.For an 8-bit image, 60 dB is sufficient to match the accuracy of no-approximation [21].
Fig. 4 shows the speed with respect to σ for each image size.SlidingConv is faster than SIMD when the image size is small but slower than SIMD as the image size increases.SlidingConv stores more data per loop because Halide can only store the calculation results per loop unit.SlidingConv uses more cache space per loop than SIMD, and the frequency of cache misses increases with image size.

2) TRADE-OFF BETWEEN APPROXIMATION ACCURACY AND SPEED
We compared the performance evaluation of Gaussian convolution with various algorithms and their implementations.We computed the actual Gaussian convolution responses using the entire image (i.e., for large kernel sizes).Therefore, they are approximated using an IIR or SDCT filter, or the convolution radius is narrowed to a smaller size to speed up the process.Here, we evaluated various implementations regarding the trade-off between speed and approximation on CPU and GPU.We used the convolution result with a double precision of 6σ as the correct solution and evaluated its approximation accuracy using the peak signalto-noise ratio (PSNR).
a: CPU Fig. 5a shows the speed to PSNR with varying approximation orders or radius on Core-i5 10400 CPU.We used 7 methods: SlidingConv, SIMD (AVX) implementation, sepFilter2D-OpenCV's FIR filtering, ITK-class RecursiveGaussian-ImageFilter in Insight Toolkit (ITK) for IIR filtering, RecFilter [19], RecFilter VYV [20]-an improved RecFilter representation for the Vliet-Young-Verbeek IIR filtering [7], and Halide implementation for separable filtering.Sliding-Conv is the fastest and has sufficient accuracy because it is in the top-left corner, and SIMD has the nearest performance.The conventional DSL of RecFilter for IIR filtering has low performance (outside the plot) because the implementation is mainly for GPU.The improved RecFilter [20], which is for the CPU, has a high-speed performance; however, its speed is slower than that of SlidingConv, and its accuracy is low.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.These three implementations work more efficiently than the de facto libraries (OpenCV and ITK).
b: GPU Fig. 5b shows the RTX3060 GPU case.We used 6 implementations: Sliding Conv, gpufilter [15], [17] with alg5f4 option-one of the fastest GPU IIR filtering, OpenCV's sepFilter2D on OpenCL, ITK VYV-ITK's GPU IIR implementation, RecFilter, and Halide implementation for separable filtering.The difference from the CPU experiment is that we changed the plots of SIMD to those of gpufilter and erased those of the improved RecFilter (VYV).Because there was no GPU implementation of the SDCT, we used gpufilter instead.The gpufilter library is a highly efficient convolutional implementation.We omitted the improved RecFilter because it was not created for GPUs.
SlidingConv and gpufilter were faster than the de facto libraries, and SlidingConv was the fastest with sufficient accuracy.The proposed method is faster, and its plots stick to the left edge.

3) DEPLOYMENT ON VARIOUS ARCHITECTURES
This experiment provides an example of the proposed method's effectiveness even when there are insufficient optimization codes for a particular computer architecture.

This is because SlidingConv can optimize codes for various architectures. a: ARM
The SDCT hand-tuned code is available only for AVX [21]; therefore, the code must be rewritten if the CPU architecture is significantly different.For example, the vector operations must be rewritten for NEON in ARM CPUs.Otherwise, we must use an unoptimized portable C++ code.Sliding-Conv avoids these problems without changing the code.Fig. 6a shows the filtering speed on M1 Pro (ARM64) and Core i5-10400 (x86/64).We compared the following four implementations: SlidingConv (Intel), SlidingConv (ARM), CPP (Intel), and CPP (ARM).The C++ code of SDCT was not optimized (i.e., without parallelization and vectorization on x86/64).
CPP (ARM) and CPP (Intel) were significantly lower than SlidingConv.SlidingConv (ARM) is as fast as SlidingConv (Intel); SlidingConv is fast regardless of CPU architecture.
SlidingConv is as fast as SIMD on AVX CPUs and faster than SIMD on AVX-512 CPUs.SlidingConv can generate code optimized for AVX-512 through scheduling method, whereas SIMD only executes AVX instructions.SlidingConv on i7-7800X is approximately 1.9 ms faster and that on i9-11980 is approximately 0.6 ms faster than SIMD.This is the difference in the number of FMA units, with a larger number resulting in more effective optimization.

c: GPU:
In the GPU case, a complete code refresh is usually required, whereas SlidingConv can perform this task only by changing the schedule.Fig. 6c shows the speed of Gaussian filtering on various GPUs: RTX2060 Super (Turing), RTX3060 (Ampere) and RTX4090 (Ada Lovelace).
SlidingConv is fast for all GPU architectures, and the filtering speed is proportional to the GPU performance.The gpufilter had relatively slow results for RTX3060 compared to the others, whereas it had a higher FLOPS than RTX2060 Super.SlidingConv can generate code optimized for various GPU architectures, whereas the gpufilter is optimized for the Turing architecture.

B. FUNCTIONALITY 1) SCHEDULING OF SLIDINGCONV
This section examines the impact of changing the tiling width, which primarily influences SlidingConv scheduling.
a: CPU Fig. 7a shows the speed-to-tile width, which can be set using the SlidingConv scheduling method; the other scheduling is the same as cpu_auto_schedule).A tile width 128 was the fastest on i5-10400, whereas 64 was the fastest on i9-9900K and i9-9980XE.This indicated that the optimal tile width depends on the CPUs (clock speed, threads, cache space, etc.) and that different scheduling is required for each CPU to utilize its performance effectively.Therefore, we provide scheduling features for SlidingConv.
b: GPU Fig. 7b shows various GPUs' results of the filtering speed for changing the tile width scheduling, fixing the other schedule to be the same as gpu_auto_schedule.For RTX2060 Super and RTX3060, a tile width of 32 was the fastest, while for RTX4090, it was 16.This indicated that the optimal tile width also depends on the GPU, and different scheduling is required for each GPU.

2) PRE/POST-PROCESSING
When adding pre/post-processing to SDCT filtering, the processing must be embedded in the filter code for high efficiency, which requires significant code modification.Fig. 8a shows the speed of unsharp masking, which requires pre-and post-processing.We compared five methods: Slid-ingConv (Gaussian), Gaussian filtering only by SlidingConv; SlidingConv (Unsharp), Unsharp masking by SlidingConv with pre/post-processing features; SlidingConv (Unsharp 7574 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.SlidingConv (Unsharp) is as fast as SlidingConv (Gaussian); thus, there is little overhead for the pre/post-processing features of SlidingConv.Additionally, SIMD (Unsharp) is significantly lower than SIMD (Gaussian), and SlidingConv (computeroot) is also lower than SlidingConv (Unsharp); it is important to write filtering in a single loop.Therefore, SlidingConv is efficient for writing image processing using SDCT filtering.Figs.8b, 8c, 8d show the resulting images.The PSNR between the OpenCV and SlidingConv outputs is 76.66 dB, invisible to the human eye.

3) DIRECT CURRENT SPECIALIZATION
This section verifies the effectiveness of the specialization of DC components.Fig. 9 shows the computational time and accuracy of Gaussian filtering for each approximation order with and without DC specialization.DCT-I/V with DC specialization uses the update equaition (12) for k = 0 while DCT-I/V without DC uses equaition (8) for k = 0. Fig. 9a shows DCT-I/V with DC is approximately one order of magnitude faster than without DC for all DCT types and orders.Therefore, DC specialization is crucial for the speedup factor.
Fig. 9b shows that filtering with DC has higher accuracy than without DC for all DCT and orders because there is no round error in the floating-point arithmetic in (12).Therefore, direct current specialization is crucial for accuracy.

4) OPTIMIZATION OF RADIUS
This section shows the effectiveness of radius size optimization.Fig. 10 shows the accuracy of Gaussian filtering with DCT-V for each approximation order.For R = opt, we optimized the radius using (16).For R = 6σ , we set the radius to 6σ , sufficient for the typical FIR convolution.The R = opt case has approximately 20 dB higher accuracy than the R = 6σ case for all orders.Thus, an appropriate radius setting impacts the filtering accuracy.

C. DESCRIPTIVITY 1) DESCRIPTIVE EFFICIENCY VS SIMD AND HALIDE
We evaluated the efficiency of the code description based on the number of words and code lines.We used the new word  counting metrics for the Halide code, and a word counting example is shown in Program 9. Table 8 lists word counts of SlidingConv and the Halide code generated for the code in Sec.V-A1.We excluded the kernel descriptions from the contains only 36 words, whereas the generated Halide contains 1352 (×37 words).Furthermore, the code (Program 10, discussed later) includes the scheduling part of RDom in the algorithm part.Writing the code is complex, contradicting Halide's advantage of describing the algorithm and scheduling parts separately.
Additionally, we compared the line counts of SlidingConv and manually optimized the SIMD code.Line counting did not include blank lines or comments.SlidingConv has only 11 lines, whereas SIMD has 21229 lines (×1900).SIMD implementation is complex, whereas SlidingConv is simple and highly readable.In addition, if we implement unsharp masking, as shown in Program 7 in SIMD, a tremendous amount of code modification is required; however, we can realize this by adding only a few lines with SlidingConv.Additionally, code optimization can be easily realized through scheduling.We cannot show the SIMD results of loop fusing for unsharp masking because it is difficult to reimplement more than 20,000 lines.

2) GENERATED HALIDE CODE
Program 10 shows the generated Halide code of Program 7 (x-direction without scheduling code).The actual volume would be approximately ×4 larger since we have additional codes for scheduling, y-direction, std::function kernel settings, etc.
Compared with Program 7, it is more complex because it describes a mixture of scheduling (tiling) and an algorithm (initialization process).This is unavoidable when writing recursive filters in Halide.Therefore, we propose a DSL for SDCT filtering.For example, in Program 10 line 4: xo = (x / 256) and let t636 = ((256*xo) + (rxi − 0)), the algorithm part of the generated Halide code is a scheduling-aware code that considers scheduling by splitting in the filter direction.We introduced a constant value of 256 through scheduling to split the image width.When Halide writes recursive filtering, it requires RDom descriptions for the recursion in the algorithm part.To split the recursive filter in this direction, we should change the extent of RDom.This description implies that the scheduling of the recursive filter must be considered when writing the algorithm in Halide.

3) GENERATED LOOP STRUCTURE
Program 11 shows the scheduling loop generated for Program 7 (x-direction).In Program 11's lines 24-27, recursive filtering can be processed in a single-pixel loop because the update functions of the delta updating for each k(∈ {1, 2, 3}) are in one loop.
Halide's autoscheduler finds the optimal scheduling from the descriptions of the algorithm-only part; however, Halide cannot find it in recursive filtering because its algorithm part is a scheduling-aware code.For example, when x-directional filtering in a recursive process is vectorized in the x-direction, it is necessary to compute each pixel individually, making it impossible to vectorize them in the filter direction and compute them together, which introduces an incorrect output.This is a limitation of Halide's autoschedulers.Therefore, SlidingConv supports manual scheduling and provides default scheduling via the cpu_auto_schedule method.

VI. RELATED WORK A. PARALLEL RECURSIVE FILTERING
Recursive processing is a typical technique for accelerating filtering.An example is the summed area table (SAT) [40] for box filtering in image processing.In computer vision, this is called an integral image [41].SAT is a particular case of a first-order recursive filter.SAT can realize two types of constant-time approximations of FIR filters: repeating box filters in serial systems [42], [43] and stacking box filters in parallel systems (stacked integrated image [44], [45]).A survey paper [46] compared accuracy and speed, including IIR filtering; however, this paper focused on a simple C on x86/64 CPUs, which is not optimal from the parallelization and vectorization aspects.core of the recursive processing of prefix sums, y i = x i + y i−1 , is not easy for parallelization; thus, there are various implementations of recursive filters on GPUs.First, massively parallel GPU implementation parallelizes the rows and columns without dependencies [47].However, this is not sufficiently parallel for massively parallel environments, and the data locality is low.Therefore, parallelization by tiling has been considered for a long time [10], and the GPU implementation of the recursive filter by Nehab et al. used the superposition property to reduce the dependence on tile computation and improve efficiency [13].This filter is only available in serial type.Other serial-type filters have been implemented in the Insight Toolkit (ITK) [9] with highly efficient CPU and GPU implementations [14].As an extension, there are also implementations of additive forms, such as Deriche's form [3], which are more highly parallel [15].These IIR filters require the consideration of infinite-length taps; however, there are efficient implementations [16].More efficient implementations extend to GPUs for 3D filtering [17].These implementations [13], [15], [16], [17] are summarized on the following page. 2

B. SHORT-TIME FOURIER AND SLIDING TRANSFORMS
The fast Fourier transform [48] is a classical acceleration technique, and various libraries provide its functions (FFTW [49], Intel MKL, cuFFT).Recently, the short-time Fourier transform (STFT) [50] has been used for local analysis.Libraries for STFT included MATLAB stft, PyTortch pythorch.stft,TensorFlow tf.signal.stft,librosa [51], and the large time-frequency analysis toolbox (LTFAT) [52].STFT is slower than FFT in usual calculations because STFT computes FFT redundant on a block-by-block basis.However, a sliding transform reduces the order and is called a sliding discrete Fourier transform (SDFT) or a sliding window discrete Fourier transform (SWDFT).The sliding transform efficiently computes frequencies using an incremental formula based on the relationship between adjacent frequency components in a short-time DFT/DCT/DST.SDFT is summarized in detail in the following papers [53], [54], [55].
Frequency transformations have many applications, including linear convolutions, correlations, and spectral analyses.Among these, SDCT is used for convolution, and specialized transforms for convolution have been proposed.Convolutions usually require a forward transform, a filter, and an inverse transform; short-time transforms also require the same chains.However, convolution-specific methods hide the inverse transform in the flow, and SDCT filters are one such method.Initially, SDCT filters were used in the context of integral images [56] but were refined to various types of SDCT filters: I [23], [24], [29], III [27], V [25], [26], and VII [28].
Regular FIR filters have several parallelization patterns [57] and can be implemented efficiently depending on the situation.However, there is still little research on the efficient hardware implementation of SDCT filters, which are FIR filters realized as recursive filters.This research is limited to CPU [21], [58].The 1-pass 2D filter [58] is highly cache-efficient with fewer synchronizations, although only multichannel data can vectorized.This is useful for the multichannel-friendly filter [32], [59] and constant-time stereo matching [60].An efficient vectorization method [21] using FMA has also proposed for pure SDCT filters.

C. DSL FOR IMAGE PROCESSING
In parallel computing, various computing patterns exist [61], such as map, stencil, reduction, and scan patterns.In addition, parallel computation and memory structures vary depending on computer architecture.Therefore, programmers must program carefully according to solving problems and use the appropriate language depending on the architecture (e.g., NVIDIA CUDA, AMD HIP, HSL, OpenCL, OpenMP, OpenACC, MPI, SYCL).
DSLs that optimize the data stream to compute these patterns efficiently have been proposed in various domains.Stencil computation is a representative field of parallel DSL [62], [63], [64] because stencil patterns are used in various scientific problems.Image processing is a popular field in DSLs.Image processing involves the most parallel patterns; however, most image-processing algorithms include map, stencil, and reduction patterns.
More specific DSLs exist in image processing for FPGAs.Darkroom [72] 7 is an early image-processing DSL for FPGA that was embedded in Terra to extract the maximum reuse of data.SPIRAL [73] 8 is a DSL limited to signal processing.The following research extends functionalities, such as Rigel [74], 9 RIPL [75], 10 SODA [76], 11 and Aetherling [77]. 12In addition, extensions of image-processing DSL to include an FPGA backend were proposed.PolyMage's FPGA expansion [78], HIPACC-FPGA [79], [80], 13 and HipaccVX [81] 14 have been proposed.Halide has been extended to include an FPGA backend, such as Halide-HLS [82]  15 Halide to FPGAs [83] and Hetero-Halide [84]. 16 Halide has also been extended to other computing units such as DSPs [85], push memory [86] and TensorCore [87].To support accelerators while considering computation scheduling functions, DSLs for directly generating native language code (e.g., C and C++) rather than extending the DSL are also emerging, such as Exo Language [88]. 17 Current DSLs are extended to more specialized and narrow domains (e.g., machine learning) and broader domains.For deep learning DSL, TVM [89], [90], 18 Tensor comprehensions [91]  19 and Ansor [92] have also been proposed.TVM shares many similarities with Halide modified from the Halide IR.DSLs for deep learning are reviewed in [93].For more general-purpose DSLs, a polyhedral model for loop parallelization was used for wider domain adaptation.A polyhedral compiler can be used for image processing but is not limited to image processing: PLUTO [94], 20  Polly [95], 21 PENCIL [96], 22 and TIRAMISU [97]. 23These DSLs can be used for Halide, and Distributed Halide [98] is using Polly for better scheduling.Further research has been proposed to convert binary files and old native code to DSL.Helium [99]  Another extension direction is to move away from the stencil calculations; one is the proposed method.SlidingConv was used for the scan pattern with the tiling strategy instead of using stencil patterns for the usual convolution.RecFiler [19] and its extension [20] are similar extension patterns.Most image-processing DSLs are not Turing complete, and all Halide computations are over regular grids.Thus, recursions must have bounded domains, and additional loop generation by scheduling, such as initialization in SlidingConv, was not allowed.Indigo [101]  26 and Opt [102]  27 support image processing in matrix representation for inverse image processing, which commonly uses matrix inverse and FFT.RandConv extends Halide's stencil computations to sparse stencil computations [103].

VII. CONCLUSION
We present a new DSL for image convolution based on a sliding DCT for high-performance computing on various architectures.The sliding DCT can accelerate convolution's fundamental image processing tools by converting FIR filtering to recursive filtering.However, recursive processing prevents efficient parallel image processing and complicates the code.We solved this problem by providing only a scheduling interface and hiding the complex high-efficiency parallel code generation within the DSL.In addition, our DSL supports optimal code generation with minimal code modification in various situations, adding pre/post-processing for filtering and changing architectures (x86/64 AVX, AVX-512 CPU, ARM CPU, and GPU).We show that our DSL works more efficiently than the de facto libraries of OpenCV and ITK and the conventional work of RecFilter [19] and gpufilter [15], [17] on various CPU and GPU architectures.We also show the description efficiency that our DSL has equivalent or better performance than hand-tuned CPUimplemented code with a 1/1900 code length.
output value at position x 20: Let f (x) ∈ I W (x ∈ S = {0, 1, . . ., W −1} ⊂ N) be the input signals, where I = [0.0: 255.0] ∈ R is the range domain, W is the size of the signal.Let g(n) ⊂ R 2R+1 be the kernel weight, where R ∈ N denotes the kernel radius.The output

PROGRAM 3 .
Pseudo RecFilter code for sliding transform convolution.

PROGRAM 4 .
Internal code for fixing loop splitting problem.

PROGRAM 5 .PROGRAM 6 .
Internal code for fixing no-starting-point problem.Internal code for fixing fixed-input image problem.

FIGURE 3 .
FIGURE 3. Computational time of Gaussian filtering for each σ and DCT type on x86/64 CPU.Speed and accuracy of Gaussian filtering for each approximation order and DCT type on x86/64 CPU (Intel Core i5-10400).

FIGURE 5 .
FIGURE 5. Trade-off between accuracy (PSNR) and speed for various methods on x86/64 CPU (Intel Core i5-10400) and GPU (NVIDIA GeForce RTX 3060).Left-top points indicate high performance.The image size is 512 × 512.The parameters are σ = 3.Note that the output of RecFilter for CPU is 15.26 ms with 54.90 dB, which is outside of (a).

PROGRAM 9 .
An example code for word counting.

FIGURE 9 .
FIGURE 9. Time and accuracy of Gaussian filtering for each approximation order with σ = 3 on Intel Core i5-10400.

FIGURE 10 .
FIGURE 10.Effect of auto radius setting of gaussian filtering for each approximation order.σ = 3.
24 can convert compiled x86 binary codes into Halide codes, and DEXTER[100] 25 can automatically translate image-processing libraries into Halide.

TABLE 3 .
Optimized delta functions for each DCT type.

TABLE 4 .
Major classes and scheduling method of Halide.

TABLE 7 .
Used codes for each experiment.

TABLE 8 .
The number of words and code lines.