Digital Filter Architecture Based on Modified Winograd Method F(2× 2, 5× 5) and Residue Number System

Improving the characteristics of digital signal processing devices is an important task in many practical problems. The paper proposes the architecture of a two-dimensional digital filter with a $5\times 5$ mask, in which calculations are performed according to the Winograd method in the Residue Number System (RNS) with moduli of a special type. Theoretical analysis and hardware Field-Programmable Gate Array simulation are presented. The results show that the fragment throughput fr/s (number of fragments per second) of the device is 29.6%– 724.7% higher than state-of-the-art solutions. This is achieved by the combination of the Winograd method, which reduces the number of multiplications, with the RNS arithmetic, which performs addition and multiplication under smaller operands in parallel. However, our experiments showed that the proposed method requires up to 2.54%– 11.01% more Look-Up Tables and 3.58%– 19.83% higher power consumption compared to known analogues.


I. INTRODUCTION
Digital filters are widely used as components of complex digital signal processing and analysis systems. These systems are used in practical tasks such as medicine [1], [2], [3], [4], geolocation [5], [6], video surveillance systems [7], product quality control in production [8], and many other areas. In these practical problems, performance plays a main role. Therefore, development of highspeed digital signal processing devices is an important problem [9], [10].
Operations parallelization is a common approach to increase the performance of a device. However, in many cases this method leads to an increase in hardware resources [11]. One of the approaches to reduce hardware resources is The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera.
the Common subexpression elimination (CSE) technique to minimize logical operators and reduce the logical depth [12], [13].
Although the calculation of the filter coefficients according to the given parameters within the device is making the filter more versatile, it requires additional hardware resources [14]. Therefore, it is advisable to calculate the filter coefficients in advance and store them in the device memory. Moreover, when the form of the filter coefficients is known in advance, this allows to optimize device architecture [15].
The main computational load during filtering is the repeated execution of the multiplication operation. It is to reduce the number of multiplications to increase the performance. In [16], the Winograd filtering method was proposed, which reduces the number of multiplications in the filtering process by increasing the number of additions. Another approach is parallel computations. The Residue VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Number System (RNS) is a non-positional number system where numbers are represented are represented as set of reminders by independent co-prime moduli and arithmetic operations can be performed in parallel [17]. The authors of [18] propose a method of constructing digital filters in RNS to automate the device design process and provide an effective ratio of performance and energy efficiency. The article [19] presents a digital filter architecture based on the Winograd method and RNS for a 2 × 2 filter mask. Unfortunately, the case of a 2 × 2 mask considered in the article is rarely used in practice.
Using RNS in real applications faces the problem of implementing computationally complex operations, such as forward and inverse conversion to RNS from positional representation, sign detection, comparison of numbers, and division. Despite the listed problematic operations, RNS allows to increase the speed of calculations, for example, as it shown for convolutional neural networks in [20]. To use advantage of non-positional nature of RNS the Winograd method should be modified accordingly. In this paper, we propose a new approach to the design of the device of digital filters based on RNS and modified Winograd method. Our contribution is summarized in the following list: • A new modified Winograd method is proposed to increase the performance of two-dimensional digital filters with a 5×5 mask.
• Winograd method 3-modulus RNS with moduli of a special form 2 α and 2 α − 1 has been merged to increase the performance of the digital filter.
• The architecture of a digital filter device with a 5 × 5 mask has been developed based on the proposed modified Winograd method.
• Performance of the digital filter is theoretically evaluated based on the unit-gate model [21] was made. Theoretical evaluation showed the performance advantage of the proposed architecture compared to known analogues.
• The results of our FPGA simulation show that the proposed filter architecture has a higher fragment throughput by 29.6% -724.7% compared to analogues. The proposed device is designed to filter a 2D signal depending on the filter mask being used. It can perform various functions, such as smoothing, noise removal (impulse, Gaussian), sharpening, and edge detection. The main target application of the proposed filter is the hardware accelerators design of convolutional neural networks (CNN), since the 5 × 5 mask is often used in CNN architectures [22], [23].
The rest of the paper is organized as follows. The second section presents the features of digital filtering in RNS. The third section consists of the known Winograd method for twodimensional filtering. Forth section proposes modification of Winograd method using RNS with moduli 2 α and 2 α − 1 for design digital filters. Fifth section contains results of theoretical analysis and simulation. Analysis of the research results are presented in the sixth section. The conclusions are presented in the seventh section.

II. APPLICATION OF RNS FOR DIGITAL FILTERING
The tool for implementing digital signal filtering are digital filters, which are usually divided into filters with a finite impulse response (FIR) and filters with an infinite impulse response (IIR). Based on a sequence of signal samples X (n) a signal Y (n) is formed at the output of the FIR filter defined by the formula: where f i are filter coefficients and P is a filter order. Figure 1 shows the FIR filter architecture. The device input receives a sequence of signal samples X (n) and filter coefficients f i , and the output is a signal Y (n). The multiplication with accumulation operation according to the equation (1) is performed using multiply-accumulate units (MAC) shown in Figure 2. The MAC device consists of a partial product generator (PPG) unit, which is formed from an array of AND gates [24] and a multi-operand adder (MOA).
MOA units can be implemented using a tree of various adders. In this paper we use carry-save adder (CSA) [24], which convert the addition operation of three numbers to addition of two numbers. The result of the CSA-tree is added using a Kogge-Stone parallel-prefix adder (KSA) [25].
The MOA device architecture is shown in Figure 3. A sequence of terms {P i } is fed to the input of the device, where 0 ≤ i ≤ n and the output is the sum S.
We use RNS as one of the ways to accelerate computations using parallelism. Any integer 0 ≤ A < P can be uniquely represented in RNS as residues from division into system modules A = {a 1 , a 2 , . . . , a n }, where P -is the RNS dynamic range equal to the multiplication of coprime modules {p 1 , p 2 , . . . , p n }.
Digital filtering according to formula (1) in RNS is performed in several stages ( Figure 4). First, it is necessary to convert data from the positional number system (PNS) to RNS. Then, filtering is performed in parallel on several computational channels, which correspond to the RNS moduli. Next, the inverse conversion from RNS to PNS is performed.
The type of RNS moduli affects the performance of calculations. Therefore, their choice is an important problem when designing application systems that use RNS arithmetic. On the one hand, the moduli set must provide a sufficient dynamic range of the system for unambiguous numbers representation in RNS. On the other hand, the moduli must be balanced in such a way that the execution time for each channel is approximately the same and doesn't cause long system downtime for any computing channel. Finally, the RNS moduli of the special form 2 α and 2 α − 1, α ∈ N, where N stands for the set of natural numbers, make it possible to avoid the resource-consuming operation of modulo division.
We propose to use the modified Winograd method with calculations in RNS with modules of a special form 2 α and 2 α − 1 to implement digital filtering.

III. WINOGRAD FILTERING METHOD F 2 × 2, 5 × 5
One-dimensional filtering by the Winograd method can be represented in matrix form as: where operator ⊙ denotes element-wise matrix multiplication, A, G and B are transformation matrices, w is onedimensional filter mask, d is data vector, z is filtering result [26]. The algorithm of one-dimensional filtering according to the Winograd method is usually denoted F (n, k), where n is a vector'sz size, and k is filter mask w size. Two-dimensional filtering by the Winograd method in matrix form is [26]: where W is a two-dimensional filter mask, D is a twodimensional data array, and Z is a two-dimensional array of filter result. The two-dimensional filtering algorithm according to the Winograd method is usually denoted Consider one-dimensional filtering by the Winograd method using the example of the case F (2, 5). We represent the vectors w, d and z as polynomials Then the filtering can be represented as a product of polynomials Let us introduce a polynomial m(x) of degree 6, and represent d(x) as the remainder modulo m(x) If we replace m(x) of degree 6 with a polynomial of degree 5, then where The transformation matrix A is composed of coefficients with remainders after division z(x) by m (i) (x) and has the following form Let The transformation matrix B is composed of the coefficients of the polynomials M (i) (x) and m(x), and coefficients Using the extended Euclidean algorithm, we compute The transformation matrix G is composed of the coefficients of the division residues For two-dimensional filtering F (2 × 2, 5 × 5) calculations are made according to formula (3). Next, the device architectures of two-dimensional filtering by the Winograd method F (2 × 2, 5 × 5) with calculations in RNS are presented.

IV. THE FILTER ARCHITECTURE ACCORDING TO THE MODIFIED WINOGRAD METHOD F 2 × 2, 5 × 5 IN THE RESIDUE NUMBER SYSTEM
A new filtering method based on the Winograd method based on RNS with moduli of a special form 2 α and 2 α − 1 is proposed to increase the performance of digital filtering.
Let's divide a two-dimensional signal into fragments D with size m×m, m > k. Each fragment is processed by a k ×k filter w using the Winograd method F (n × n, k × k) with step n for each dimension. In the case of F (2 × 2, 5 × 5) the two-dimensional signal is divided into 6×6 fragments and the processing is performed with a step of 2, the result of filtering one fragment D is a filtered 2×2 fragment Z. Figure 5a shows filtering process of a 256 × 256 2D signal with a 5×5 filter mask using Winograd method F (2 × 2, 5 × 5).
Performing filtering with k × k mask in traditional way requires k 2 multiplications. Then n 2 k 2 multiplications are required to form an n×n filtered fragment. Winograd method F (n × n, k × k) requires (n + k − 1) 2 multiplications [26]. Then to filter a 6×6 signal fragment with a 5×5 filter mask it is necessary to perform 900 multiplication operations. Using Winograd method F (2 × 2, 5 × 5) allows to reduce number of multiplications to 36, that is, the computational complexity is reduced by 25 times.
The procedure of two-dimensional filtering according to the Winograd method described by formula (3), processes the signal in several stages. Let's denote the result of the filter mask transformation is denoted as U = GWG T . Since the filter coefficients are constants, this transformation can be performed once in advance, which means it does not carry a computational load. Let's designate the transformation result of input data D as V = B T DB, and the result of element-wise matrix multiplication as M = U ⊙ V . Then, considering the introduced notations, formula (3) becomes Z = A T MA. Figure 5b shows filtering process of signal fragment D according to the Winograd method using the case F (2 × 2, 5 × 5) as an example.
The addition of several numbers modulo 2 α and 2 α − 1 is proposed to be performed using a multi-operand modulo adder, denote them as MOMA 2 α and MOMA 2 α −1 respectively ( Figure 6). These devices consist of a CSA tree and a KSA. The vector P = P 0 , P 1 , . . . ,P β , comes to the input of the devices and the sum S is formed at the output. For calculations modulo 2 α − 1 End-Around-Carry (EAC) technique is used [27].
Data transformations modulo 2 α are performed using the devices DTE 2 α (data transform element), shown in Figure 7a. The device input is the vector {P i }, where 0 ≤ i < l. Since negative numbers modulo 2 α are represented in a two's complement code, a correction constant C, is introduced, equal to the number of vector {P i } negative elements. SL (shift left) blocks perform a left shift by n bits, which corresponds to a multiplication by 2 n . Next, addition is performed using the CSA adder tree. Data conversion modulo 2 α − 1 is performed using the device DTE 2 α −1 (Figure 7b) differs in that the technique of cyclic transfer of EAC high bits is used, and SLA devices (shift left around) perform cyclic shift by n bits. Since negative numbers modulo 2 α − 1 are represented in the one's complement code, then adding a correcting constant is not required.
The calculation of one matrix V row elements modulo 2 α is performed by the DTR 2 α device ( Figure 8). This device performs data transformation from matrix D and generates elements V i,j , 0 ≤ i ≤ 5, 0 ≤ j ≤ 5. To calculate the elements of the i-th row of the matrix V modulo 2 α , the input DTR 2 α is supplied with the data vector The input data goes to the DTE 2 α data conversion devices, the result is added using MOMA 2 α adders. Thus, the data conversion device modulo 2 α (let's denote it as DT 2 α ) consists of 6 DTR 2 α devices.
Calculation of the matrix V elements modulo 2 α − 1 requires the representation of negative numbers in the one's complement code, that is, the inversion of the number, therefore, the correction constants are not involved in the calculations. Therefore, the device for data transformation modulo 2 α − 1 (let's denote it as DT 2 α −1 ) consists of 6 DTR 2 α −1 devices (shown in Figure 8) the inputs of which are vectors D i and N i .
Element-wise multiplication of matrices U and V is performed using devices EWM 2 α and EWM 2 α −1 , consisting of 36 parallel multipliers of two numbers MUL 2 α and MUL 2 α −1 respectively, and shown in Figure 10. Elements of the matrices U and V are supplied to the input of the device. The multiplier MUL 2 α consists of a partial product generator modulo 2 α PPG 2 α , which is formed from an array of AND gates [21] and MOMA 2 α . The MUL 2 α −1 device consists of a partial product generator modulo 2 α − 1 PPG 2 α −1 , using the EAC technique, and MOMA 2 α −1 . Thus, a 6 × 6 matrix M is formed.   The calculation of the matrix Z one row elements modulo 2 α is performed by the FTR 2 α deivce ( Figure 11). This device performs the final data transformation from the matrix M and generates elements Z i,j , 0 ≤ i ≤ 1, 0 ≤ j ≤ 1. To calculate the elements of the i-th row of the matrix Z modulo 2 α , to the input of FTR 2 α supplied data vector R i = R i 0 , R i 1 , . . . , R i 29 , correction coefficients C i = C i 0 , C i 1 , C i 2 and offset vector The input data goes to the DTE 2 α data conversion devices, the result is added using MOMA 2 α adders. Thus, the device for data transformation modulo 2 α (let's denote it as FT 2 α ) consists of 2 FTR 2 α devices.
Calculation of the matrix Z elements modulo 2 α − 1 requires the representation of negative numbers in the inverse code, therefore, the correction constants do not participate in the calculations. Therefore, the device for data transformation modulo 2 α − 1 (let's denote it as FT 2 α −1 ) consists of 2 FTR 2 α −1 devices (shown in Figure 8) the inputs of which are vectors R i and N i . Figure 13 shows the proposed filtering device F (2 × 2, 5 × 5) 2 α modulo 2 α . The U = GWG T filter mask transformation is done preliminarily, and the result is stored in the device memory. Since operations with negative numbers require their presentation in two's complement code, the correction constants are also stored in the device memory. Figure 14 shows a circuit of the proposed filtering device F (2 × 2, 5 × 5) 2 α −1 by modulo 2 α − 1. Only the converted filter mask is stored in the memory of this device.
The parameters of a device based on TMAC blocks in RNS that performs calculations modulo 2 α , are calculated by (30) and for devices modulo 2 α − 1 as follows [28]: U delay FIR (TMAC) 2 α −1 = 6, 8Plog 2 α Table 2 presents the results of the theoretical evaluation of the area and delay parameters of the proposed and known filters based on the ''unit-gate'' model. The processing time for a frame sized 256 × 256 was also estimated.
Hardware simulation on FPGA was carried out in the Xilinx Vivado 2018.3 CAD environment for the Virtex UltraScale xcvu440-flgb2377-3-e target board with the Flow_PerfOptimized_high optimization strategy. The proposed architecture is not tied to a specific board and can be synthesized on other target devices. Calculations are made in fixed point format. 8-, 16-, and 32-bit filters were considered. For devices with calculations in RNS with modules of a special type, the capacity of each computing channel corresponds to the degree of the module.
The hardware simulation results are presented in Table 3. To evaluate devices, parameters such as clock frequency, number of LUTs, power consumption, and performance were used, which were obtained as a result of simulation in the design environment. Device fragment throughput refers to the number of processed frames with size of 256 × 256 pixels per second.

VI. DISCUSSION
The theoretical analysis of the parameters of the proposed and known two-dimensional filters with a 5 × 5 mask showed that the use of the proposed approach based on the Winograd method and RNS with moduli of a special type reduces the device delay by 15.3% -81.3%, and the signal processing time also decreases by 15.3% -95.3%, compared with known approaches. In addition, the device based on the proposed method has a 9.66% -46.76% smaller area compared to the device based on the Winograd method [29]. Nevertheless, the application of the proposed approach increases the area of the device by 2.7% -437% in comparison with other considered known methods.
The results of hardware simulation showed that the proposed method of constructing filters based on the Winograd method and RNS allows to increase the clock frequency of 16-bit and 32-bit devices by 29.63% and 38.24%, respectively, compared to the filter based on the Winograd method [29] without using RNS arithmetic. But for 8-bit devices, the filter clock based on the proposed method is VOLUME 11, 2023 FIGURE 11. FT R 2 α device for calculating the elements of the i-th row of the matrix Z modulo 2 α . 3.23% lower. In addition, the proposed method increases the clock frequency of the device by 7.14% -105.88% compared to methods based on FIR filters with MAC and TMAC blocks [28], [30], [31].
The combined use of RNS and the Winograd method allows to reduce the number of LUTs by 9.50% -28.17%, and energy consumption by 0.49% -4.14%, compared with the Winograd method in the PNS. However,   devices designed according to the proposed method use 2.54% -11.01% more LUTs and have 3.58% -19.83% higher power consumption compared to devices based on methods [28], [30], [31].
Hardware simulation showed that the use of the proposed method based on the Winograd and RNS method increases the filter fragment throughput by 29.6% -724.7% compared to filters based on the considered known methods. However, the 8-bit device based on the Winograd method and PNS has a performance improvement of 3.22% compared to the device based on the proposed method. The slight difference between the results of theoretical analysis and the results of hardware simulation is explained by the peculiarity of the ''unit-gate'' model, which does not consider the load capacity of the device output, as well as the involved memory and the time of accessing it. As the experiment results showed, the proposed filter architecture can be applied in digital signal processing systems where high performance is required. In systems with limited hardware resources, it is better to use the filter architecture proposed in [28], although this leads to performance decrease.
The proposed filter architectures can be applied to digital filters for edge detection [32], [33] and smoothing [34], discrete wavelet transform [35], and to implement the convolution operation in the convolutional layer of the convolutional neural network [36].

VII. CONCLUSION
The paper proposes a digital filter architecture with 5×5 mask based on the modified Winograd method using RNS with moduli of special type 2 α and 2 α − 1. A theoretical analysis and its hardware implementation on FPGA were performed.
Comparison with known digital filter architectures shows that the proposed method allows to: • increase the clock frequency of 16-bit and 32-bit devices by -29.63%-38.24%, compared to the filter based on the Winograd method without RNS [29], -7.14% -105.88% compared to methods based on FIR filters with MAC and TMAC blocks [28], [30], [31].
• reduce the number of occupied LUTs by 9.50% -28.17%, and power consumption by 0.49% -4.14%, compared with the Winograd method without RNS.
• increase filter fragment throughput (fr/s) by 29.6% -724.7% compared to filters based on the known methods. The research results can be efficiently used in the design of digital signal processing systems, for example, neural networks, machine vision, and many others. ANDREI TCHERNYKH (Member, IEEE) received the Ph.D. degree from the Institute of Precise Mechanics and Computer Technology, Russian Academy of Sciences (RAS), Russia, in 1986. He is currently a Full Professor with the CICESE Research Center, Computer Science Department, Ensenada, Baja California, Mexico, and an Adjunct Professor with the Institute for System Programming, RAS, Russia. He is also the Head of the Parallel Computing Laboratory, CICESE, and the Laboratory of Problem-Oriented Cloud Computing, South Ural State University, Russia. His main research interests include resource optimization techniques, adaptive resource provisioning, multiobjective optimization, computational intelligence, incomplete information processing, cloud computing, and security. He is a member of the National System of Researchers of Mexico (SNI), Level II, and leads several national and international research projects. VOLUME 11, 2023