Digital Filter Architecture with Calculations in the Residue Number System by Winograd Method F(2 × 2, 2 × 2)

Improving the technical characteristics of digital signal processing devices is an important problem in many practical tasks. According to the Winograd method, the paper proposes the architecture of a device for two-dimensional filtering in a residue number system (RNS) with moduli of a special type. The work carried out the technical parameters theoretical analysis of the proposed filter architecture for different RNS moduli sets by the "unit-gate"-model. In addition, the proposed architecture is compared with known digital filter implementations. The theoretical analysis results showed that the proposed filter architecture makes it possible to increase the signal processing speed by 1.33 – 6.90 times, compared with the known device implementations. Also, in the paper, the hardware simulation of the proposed filter architecture was performed on FPGA, which showed that the performance of the proposed device is 1.31 – 4.12 times higher than known digital filter architectures. The research results can be used in digital signal processing systems to increase their performance and reduce hardware costs. In addition, the developed architectures can be applied in the development of hardware accelerators for complex digital signals analysis systems.


I. INTRODUCTION
Digital signal filtering is widely applied in various areas such as medicine [1,2], geolocation [3], video surveillance systems [4], quality control in production [5], and many others. Performance plays a central role in these practical tasks. Hardware implementation of digital filtering allows increasing the speed of signal processing systems [6]. Therefore, improving digital filter technical characteristics is a significant challenge.
The main computational load during filtering consists in multiply performing the multiplication operation. One of the approaches to increasing a digital filter's speed is to reduce the number of multiplications. The paper [7] proposes the Winograd filtering method, which reduces the number of multiplications in the filtering process by increasing the number of additions. The authors of [8] presented a software implementation of the Winograd method and applied it in a convolutional layer of a neural network with calculations on a graphical processor. In [9], the authors developed a hardware accelerator on Field-Programmable Gate Array (FPGA) based on the Winograd method for the convolutional layer of the neural network.
Another approach to increase the speed of devices is parallel computations. The residue number system (RNS) is a non-positional number system, which performs numbers as small residues modulo, and arithmetic operations are performed in parallel on each modulo [10]. The authors of [11] propose a method for constructing digital filters in RNS to automate the device design process and provide an effective speed and energy efficiency ratio. In [12], a new architecture for multiply-accumulate (MAC) units are proposed, which are the basis of digital filters. The proposed architecture is based on ternary value logic and RNS. However, using this approach leads to the high complexity of converting between ternary value logic and RNS. The authors of the paper [13] proposed a filter architecture with finite impulse response based on truncated MAC units (TMAC). In [14], the implementation of TMAC units in RNS with moduli of a special type 2 and 2 − 1, ∈ ℕ, where ℕ is the natural numbers set. The moduli of a special type allow reducing the operation of calculating the remainder of a division to a bit shift operation (for modulo 2 ) and an addition operation of -bit numbers (for modulo 2 − 1) [15], and use efficient addition and multiplication techniques [16,17].
In this work, the device architecture for two-dimensional filtering by Winograd method for a filter mask 2 × 2 using RNS with the moduli of the special type 2 and 2 − 1. In the experimental part of the paper, a theoretical analysis of the proposed devices' delay and area parameters is carried out. The theoretical results are confirmed by hardware simulation on FPGA.
The rest of the article is organized in the following way. In the second section, features of digital filtering in RNS are presented. In the third section, the Winograd method for twodimensional signal filtering is described. A new device architecture for filtering by Winograd method using calculation in RNS by moduli 2 and 2 − 1 is presented in the fourth section. The fifth section contains theoretical analysis results, hardware simulation on FPGA, and comparison with known digital filter architectures. The research results analysis and their discussion are carried out in the sixth section. The conclusions are presented in the seventh section.

II. DIGITAL FILTERING IN RESIDUE NUMBER SYSTEM
Digital filtering is applied for digital signal processing. In the case of processing a one-dimensional signal consisting of samples by the filter of size , filtering is described by the following formula [18]: where is a processing signal, 0 ≤ < . When processing the two-dimensional signal consisting of × samples using × filter , the filtering has the form where 0 ≤ < , 0 ≤ < . As seen from (1) and (2), signal filtering contains addition and multiplication operations. The main computational load is multiply performing the multiplication operation. One of the ways to increase the performance of digital filtering devices is performing computations in RNS. In RNS, numbers are represented on the basis of coprime numbers, called moduli { 1 , . . . , }, ( , ) = 1 for ≠ . All RNS moduli product = ∏ =1 is called the system dynamic range. Any integer 0 ≤ < it is uniquely represented in RNS as a vector { 1 , 2 , . . . , }, where = | | is a remainder under division by modulo [10].  Then, filtering is performed in parallel for each modulo. Since addition, subtraction, and multiplication operation in RNS are determined by the formulas: then the one-dimensional filtration represented by expression (1) modulo has the following form [18]: Similarly to (4), for the two-dimensional case (2), signal processing modulo is described as follows: The last stage performs the reverse RNS to PNS transform. Reconstruction of the number from residuals { 1 , 2 , . . . , } is based on the Chinese remainder theorem [19] where = / . The term | −1 | means multiplicative inverse for by modulo .
The RNS moduli type affects the device's technical characteristics, such as performance, hardware costs, and power consumption. Moduli of the special type 2 and 2 − 1, ∈ ℕ, where ℕ is the natural numbers set, avoid division operation, which requires extensive computational resources [17].
In this paper, we propose the architecture of a twodimensional filter with calculations by the Winograd method in RNS with moduli of the special type 2 and 2 − 1. The operations of conversion to RNS and reverse transform to PNS were not considered in this study.

III. COMPUTATIONS IMPLEMENTATION IN DIGITAL FILTERS BY WINOGRAD METHOD
One-dimensional filtering by Winograd method in matrix terms has the form: where the operator ⊙ denotes elementwise matrix multiplication, , , and are transformation matrices, is a one-dimension filter mask, is a data vector, is filtering result [8]. The one-dimensional filtering algorithm by Winograd method is usually denoted ( , ) where is a size of vector and is a filter mask size. Two-dimensional filtering by Winograd method in matrix terms has the form [8]: where , and are two-dimensional matrices. The twodimensional filtering algorithm by Winograd method is usually denoted ( × , × ).
Consider one-dimensional filtering by Winograd method using the example (2,2) [8]. Let us represent the vectors , , and as polynomials: Then the filtering is represented as a product of polynomials: We represent the polynomial ( ) as the remainder under division by the polynomial ( ) of the fourth degree here is a modulo polynomial division operator. If we replace ( ) of the fourth degree by a polynomial of the third degree, then where ( ) [ ( )] is the remainder of ( ) divided by ( ).

IV. THE PROPOSED FILTER ARCHITECTURE WITH CALCULATIONS BY WINOGRAD METHOD IN THE RESIDUE NUMBER SYSTEM
We divide the two-dimensional signal into fragments of size × , > . Each fragment is processed by the filter of dimension × according to Winograd method ( × , × ) with a step n in each dimension. In the case of (2 × 2,2 × 2), the two-dimensional signal is divided into 3 × 3 fragments, and the processing is performed with step 2. Fig. 2 shows a filtering process scheme for the fragment according to the Winograd method. (2 × 2,2 × 2). The two-dimensional filtering by the Winograd method described by (8) performs signal processing in several stages. The filter mask transformation result is denoted as = .
The modulo 2 and 2 − 1 addition of several numbers is performed using a multiply modulo adder (MOMA), we introduce for them the notation 2 and 2 −1 respectively (Fig. 3). These devices consist of a carry-save adder (CSA) [20] and Kogge-Stone Adder (KSA) [21]. The vector = { 0 , 1 , … , } is the input of the devices, and the sum is the output. An end-around-carry (EAC) technique is used for modulo 2 − 1 calculations [17].
(a) (b) The matrix elements modulo 2 are calculated using the device 2 (Fig. 3a). The representation of negative numbers modulo 2 requires converting them into two's complement code, that is inverting the number and adding one. Therefore, the input 2 is fed with the data vector , and the correction constant , , which is equal to the negative numbers amount, to calculate the element , modulo 2 where 0 ≤ ≤ 2 and 0 ≤ ≤ 2. An exception is an element 1,1 = 1,1 = 0,1 , which does not require any calculations. Thus, the data transformation device modulo 2 (denote it 2 ) consists of eight devices 2 , the inputs of which receive the following data: The matrix elements modulo 2 − 1 calculation requires the representation of negative numbers in the one's complement code, which is the inversion of the number. Therefore, the correction constants are not involved in the calculations. As seen from (22), the data transformation device modulo 2 − 1 (denote it 2 −1 ) consists of four devices VOLUME XX, 2021 3 2 −1 , the inputs of which receive vectors 0,0 , 0,2 , 2,0 and 2,2 , and of four EAC-KSA, to whose inputs are fed with vectors 0,1 , 1,0 , 1,2 and 2,1 .
The elementwise matrices and multiplication devices modulo 2 and 2 − 1 consist of nine multipliers 2 and 2 −1 , respectively, shown in Fig. 4. The multiplier 2 consists of a partial product generator modulo 2 2 , which is formed from an array of AND gates [20], and 2 . The 2 −1 device consists of the modulo 2 − 1 partial product generator 2 −1 using the EAC technique, and 2 −1 . We denote the elementwise matrices multiplication device modulo 2 as 2 , and modulo 2 − 1 as 2 −1 . Matrix is formed at the outputs of these devices.  Therefore, the (2 × 2,2 × 2) 2 device for filtering by Winograd method modulo 2 consists of the data transformation device 2 , the elementwise matrices multiplication device 2 and the final transformation device 2 (Fig. 5). The data fragment | | 2 is fed to the device input, and the fragment filtering result | | 2 is formed at the output.

A. THEORETICAL ANALYSIS
We use the abstract "unit-gate" model to estimate digital devices' delay and area parameters [22]. According to this model, if we denote the logic device delay delay , and denote the logical device area as area , then the logic gates descriptions are Then, according to (24), 2 parameters have the form, where is the number of terms [14]: ( 2 ) = 6.8 2 + 2 2 + 4, ( 2 ) = 3 2 + 7 − 11 + 1.
Taking into account parameters of the 2 , 2 and 2 devices -(27), (33), (36), the filter (2 × 2,2 × 2) 2 device based on Winograd method with modulo 2 calculations has the following delay and area parameters: Parameters of the proposed filtering device (2 × 2,2 × 2) with calculations in the RNS with moduli set {2 1 , 2 2 − 1, … , 2 − 1 }, shown in Fig. 6, taking into account (37) and (38), are calculated as follows: Based on parameters of the (2 × 2,2 × 2) device, presented in (37), parameters of the delay and area of the proposed filter based on Winograd were calculated with various RNS moduli sets presented in Table I, corresponding to different bit widths of the input data.
The proposed filter architecture with computations in RNS and the known filter architecture with computations in PNS [9]. The parameters of filters with a finite impulse response based on multiply-accumulate (MAC) blocks were also calculated, we denote them as , then for 2 × 2 filter mask, the delay and area parameters are calculated as follows [23]: In addition, filters consisting of TMAC units with calculations in RNS with special type modules were calculated [14]. For modulo 2 , the device parameters are calculated by formulas (41), and for modulo 2 − 1 calculation, the parameters of filter delay and area are as follows:  A theoretical analysis of the proposed device delay and area parameters based on the "unit-gate" model using various RNS moduli sets is performed, as well as a comparison with known filter architectures. Also, the processing time for a 256×256 two-dimensional signal fragment is calculated according to the device delay. The filter parameters calculations results are presented in Table II. Delay and area parameters theoretical analysis based on the "unit-gate" model of the proposed filter device and known analogs allows to conclude about a computational and space complexity of reviewed methods. Let the bit width of the RNS dynamic range be approximately , and the bit width of each computational channel is approximately .
Since the output of the filter based on the Winograd method (2 × 2,2 × 2) is four processed signal values, then for filters based on MAC [23] and TMAC [13,14] units, we calculate the complexity of the method from the calculation of sequential processing four signal fragments.
According to delay and area parameters (41), the complexity of the filtering method based on TMAC units [13] is then the computational complexity of the proposed method is less than the method based on TMAC units [13], that is ( (2 × 2,2 × 2) ) < ( ( ) 2 ). If ≥ 2 then the space complexity of the proposed method is less, that is ( (2 × 2,2 × 2) ) < ( ( ) 2 ). According to the "unit-gate" model parameters (43), the complexity of the TMAC-based method with calculations in RNS [14] is then the computational complexity of the proposed method is less than the method [14], that is ( (2 × 2,2 × 2) ) < ( ( ) ). If any integer > 1, then the space complexity of the proposed method is greater than the method [14], that is ( (2 × 2,2 × 2) ) > ( ( ) ). A description of the theoretical analysis results is presented in VI Section "Discussion".

B. HARDWARE IMPLEMENTATION
Hardware simulation of the proposed device architecture for filtering by the Winograd method (2 × 2,2 × 2) with calculations in RNS with the special moduli type 2 and 2 − 1. The proposed architecture was compared with known developments. The hardware simulation results are presented in Table III. The simulation was carried out in Xilinx Vivado 2018.3 environment for the target board Artix-7 xc7a200tffg1156-3 with optimization strategy Flow_Perfoptimized_high. The following parameters were used to evaluate the devices: clock frequency, number of busy Look-Up-Tables (LUTs), power consumption, and performance equal to the number of processed fragments 256 × 256 of two-dimensional signals per second (fragments/s). A description of the hardware implementation results is presented in VI Section "Discussion".

VI. DISCUSSION
Comparison of the computational and space complexity of the proposed and known methods showed that with an increase in the dynamic range and with a decrease in the number of RNS modules, the advantage of the proposed method decreases. Since, in practice, signal processing systems with small bit width (for example, 8-, 16-, 32-bit) are more often used, these limitations are insignificant.  Theoretical analysis results (Table II) based on the "unitgate" model of the proposed device parameters showed that RNS usage allows to reduce the device delay by 24.79% -66.77%, and the area device by 17.59% -53.67%, compared with the known implementation based on Winograd filtering in PNS [9]. In addition, the proposed device architecture has 13.47% -42.04% less delay, and 2.20% -18.03% less area, except for the 8-bit device, which has a 47.38% larger area than the known MAC-based filter architecture [23]. Compared to the known device architecture based on TMAC units with computations in PNS [13], the delay of the proposed device is 20.92% -22.22% less, but the area is 1.56% -53.37% more for 8-and 16-bit devices, and for 32-bit devices, the delay is 12.17% larger, but the area is 18.03% less. The proposed architecture of the 8-bit filter has 2.15% lower latency, but 16and 32-bit devices have 3.42% -52.52% more delay, compared to the known architecture based on TMAC units with computations in RNS. The area of the proposed device is approximate twice the area of a device based on TMAC units with computations in RNS [14]. The main advantage of the proposed filter architecture based on the Winograd method with calculations in RNS with special type modules is to reduce the processing time of a two-dimensional signal. Thus, the use of the proposed device makes it possible to reduce the processing time of a 256 × 256 signal by 1.33 -6.90 times compared to other known architectures.
The results of hardware simulation (Table III) showed that the performance of the proposed filter architecture is 1.31 -4.12 times higher in comparison with known architectures. The maximum clock frequency of the proposed device is 31.03% -38.46% higher compared with device based on the Winograd method [9] and 1.89% -2.94% higher compared to the filter based on MAC units [23] for the case of 8-and 16bit devices, but 26.21% less for the case of 8-bit devices. Nevertheless, compared with the TMAC-based filter architecture in PNS [13] and RNS [14], the performance of the proposed device is 7.89% -35.59% lower. The number of LUTs occupied by the proposed device is 18.08% -37.27% less than the filter based on the Winograd method [9], but 3.83 -7.74 times more than other reviewed known architectures. The power consumption of the proposed filtering device is 2.57% -42.46% higher than the known devices. The insignificant difference between theoretical analysis results and hardware simulation results is due to the peculiarities of the "unit-gate" model, which does not consider fan-out devices.
The high performance of filters with Winograd method calculations is explained because the result is several processed elements at once (Fig. 2). However, performance gains come at the expense of increased hardware costs, such as the number of occupied LUTs and power consumption. Thus, the proposed architecture can be successfully applied in digital signal processing systems, in which performance is a crucial criterion. However, from the occupied area's point of view, an architecture based on TMAC units with calculations in RNS is preferable [14]. The MAC-based filter [23] is advisable to use in systems with low power consumption.
The proposed filter architecture can be applied as part of the filter with a larger mask and decimation [9]. For example, filter with kernel size 3 × 3 and decimation step two may be presented as two filters based on Winograd method (2 × 2,2 × 2) and (2 × 2,1 × 1). An interesting direction of future research is the described approaches application to the implementation of filters for the filter based on Winograd method for other dimensions of filter masks ( × , × ) and their application in convolutional layers of convolutional neural networks.

VII. CONCLUSION
In this paper, we proposed the device architecture for twodimensional filtering by Winograd method (2 × 2,2 × 2) using calculations in RNS with moduli of the special type 2 and 2 − 1. The theoretical analysis is performed based on the "unit-gate"-model and shows that the speed of signal processing by the proposed device is 1.33 -6.90 times higher than other known devices. Also, the hardware implementation on FPGA is performed, showing that the proposed device performance is 1.31 -4.12 times higher than other known methods. The research results may be applied to increase the digital signal processing device technical characteristics and in the intellectual analysis systems for data preprocessing.