Classification of Moduli Sets for Residue Number System With Special Diagonal Functions

The paper presents algorithms for the generation of Residue Number System (RNS) triples with <inline-formula> <tex-math notation="LaTeX">$SQ=2^{k}-1$ </tex-math></inline-formula> and quadruples with <inline-formula> <tex-math notation="LaTeX">$SQ=2^{k}$ </tex-math></inline-formula> for some k. Triples and quadruples allow us to design efficient hardware implementations of non-modular operations in RNS such as division, sign detection, comparison of numbers, reverse conversion with using of a diagonal function from requiring division with the remainder by the diagonal module SQ. Division with a remainder in the general case is the most complex arithmetic operation in computer technology. However, the consideration of special cases can significantly simplify this operation and increase the efficiency of hardware implementation. We show that there are 5573 good RNS triples (2301 even and 2372 odd) with elements less than 10 000, as the values of SQ vary from <inline-formula> <tex-math notation="LaTeX">$2^{5}-1$ </tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$2^{27}-1$ </tex-math></inline-formula>. In contrast, RNS quadruples with <inline-formula> <tex-math notation="LaTeX">$SQ=2^{k}$ </tex-math></inline-formula> seem to be quite rare. Restricting our search to sums of the elements in a quadruple less than 4000 we find that exactly 31 such quadruples exist. Their values of SQ vary between 2<sup>20</sup> and 2<sup>30</sup> with always even exponent. We suggest the measure of RNS balance and find perfectly balanced RNS among triples according to this measure. We demonstrate the advantages of more balanced quadruples by means of hardware implementation.


I. INTRODUCTION
The current level of computer technology requires the development of parallel computing architectures and methods for organizing calculations on them. One of the ways of parallel organization of calculations at the arithmetic level is the transition from the traditional binary number system to the Residue Number System (RNS). The main idea of such a replacement is the ability to quickly and parallel processing the residues of a small bit-width when performing arithmetic operations of addition, subtraction and multiplication. This approach is very promising for practical applications that The associate editor coordinating the review of this manuscript and approving it for publication was Stavros Souravlas . require intensive execution of mainly these operations: digital signal and image processing [1], [2], cryptography [3] and machine learning [4]. The disadvantage of RNS is the high computational complexity of performing non-modular operations, which include division, sign detection and comparison of numbers [5], [6]. These limitations exist because RNS is a non-positional number system, and magnitude comparison of numbers in RNS form is impossible, so the division operation consists of a magnitude comparison operation that is also a problematic operation. Finding faster algorithms would allow detecting more promising new areas to apply RNS.
One way to implement non-modular operations in RNS is to use a diagonal function fro [7]- [10] requiring division with the remainder by the diagonal module SQ. Division with a remainder in the general case is the most complex arithmetic operation in computer technology. However, the consideration of special cases can significantly simplify this operation. Satisfying the condition SQ = 2 k allows to calculate the remainder of dividing by SQ by simply choosing the k least significant bits of the dividend, while the quotient is determined by the remaining most significant bits of the dividend. Under the condition SQ = 2 k − 1 the remainder of the division by SQ is calculated as the sum of the k-bit parts of the dividend modulo SQ [11]. Algorithms for RNS constructing with special SQ forms based on numbertheoretical properties are presented in [7] and [8]. The main drawbacks of those algorithms is the impossibility of RNS construction with predetermined dynamic ranges as well as obtaining unbalanced RNSs (with too much difference in modules bit-width) in most cases.
In this paper, we present algorithms for generation of RNS triples with SQ = 2 k − 1 and quadruples with SQ = 2 k for some predetermined k. By ''predetermined'' we mean that designers of RNS could use the results of our classification by choosing in advance k in certain range (for example, between 5 and 21 with a few exceptions for triples from Appendix A, and any even k between 20 and 30 for quadruples from Appendix B). Moreover, the lists from Appendixes A and B are complete for the ranges given in Theorems 4 and 6 below.
Our approach is based on careful dealing with the exponent of 2 in the expressions which naturally arise in targeting corresponding representations of SQ and related quantities. We show that there are 5573 good RNS triples (2301 with three even modules and 2372 with two odd and one even modules) with elements less than 10 000, as the values of SQ vary from 2 5 -1 to 2 27 -1. In contrast, RNS quadruples with SQ = 2 k seem to be quite rare. We restrict our search to sums of the elements in a quadruple less than 4000 and find that exactly 31 such quadruples exist. Their values of SQ vary between 2 20 and 2 30 with always even exponent.
The rest of the paper is organized as follows. Section II presents the construction of RNS with a convenient form of diagonal function. Sections III and IV present effective algorithms for computing triples and quadruples, respectively, to build effective RNS blocks based diagonal function. Section V discusses approach to measuring RNS balance and the methods of RNS construction using proposed approach. Section VI presents hardware simulation results. Discussion is presented in Section VII. The conclusion of the paper is reported in Section VIII.

II. RNS MATH BACKGROUND. METHODOLOGY
Let m 1 , m 2 , . . . , m n , n ≥ 3, be mutually co-prime positive integers and be a corresponding Chinese Remainder Theorem. In RNS, a solution X of (1) is associated to the n-tuple (x 1 , x 2 , . . . , x n ) and that n-tuple is used in operations instead of X [12,Section 3.4]. Denote Then The diagonal modulus is important in computations as it is instrumental in the definition of the diagonal function where the integers k i are defined by k i m i ≡ −1 (mod SQ). It was observed that diagonal modulus of special binary representations (very low or very high Hamming weight) are useful in the construction of RNS with good performance. This point was described in [7].
Our goal is to provide certain classification results for systems (m 1 , m 2 , . . . , m n ) for small n. Such classification could be useful for designers of RNS in their search for good performance. Thus, we also discuss performance issues, adding here that predetermined range of M (see above for predetermined k) implied from our Appendixes could be very useful.
We shall use many times the following simple lemma. We use the notation v 2 (m) for the maximum degree of 2 which divides a positive integer m (i.e., m 2 v 2 (m) is an odd integer). Proof: Assume for a contradiction that v 2 (a) = v 2 (b). Without loss of generality, let v 2 (a) > v 2 (b). It is clear Remark: It is clear from the proof that the conclusion v 2 (a) = v 2 (b) of Lemma 1 is true whenever max{v 2 (a) , v 2 (b)} < k (i.e., the condition for a and b both being positive integers is dropped). We will use this fact once in our analysis of RNS quadruples.
Our approach is based on applications of Lemma 1 with appropriate representations of SQ and its relatives. This leads to significant simplifications which allow us to develop efficient algorithms achieving our goals.

III. RNS TRIPLES WITH
We are interested in diagonal modulus SQ = 2 k −1 with three odd m 1 , m 2 , m 3 . We call such triples odd.
Starting with co-prime odd m 1 and m 2 , we require Since it follows from Lemma 1 that (2) is a necessary condition. Moreover, we have m 1 + m 2 2 ω , Indeed, otherwise (2) implies that any nontrivial prime common divisor of m 1 +m 2 2 ω , m 1 m 2 +1 2 ω will be an odd prime which will divide 2^k, a contradiction.
It follows from the above that where r and s are co-prime odd integers. Then and we have to seek solutions of with respect to l. Each such solution defines a candidate for m 3 by This m 3 is approved if it is co-prime to both m 1 and m 2 .
Each odd triple can be found this way.
Proof: For m 1 , m 2 , m 3 as chosen above we have consecutively by (4) and (6) Theoretical investigations can be separated into two cases depending on whether (5) always or not always has solutions. Case 1. r = p α , where p is an odd prime, and 2 is primitive root modulo p α . In this case (5) has always solutions.
Case 2. r is an odd integer such that (5) has solutions. This includes two sub-cases: (2.1) r = p α , where p is an odd prime, but 2 is not primitive root modulo p α and (2.2) r is divisible to at least two distinct primes. However, it is not necessarily important to distinguish between these cases. In particular, we may skip the check whether 2 is a primitive root and just care for (5) having a solution.
Therefore, the classification itself can be organized as follows.
Step 1. For fixed even positive integer A, generate all pairs (m 1 , m 2 ) of odd positive integers such that m 1 + m 2 = A, m 1 < m 2 .
Step 4. Check in increasing order solutions of (5) until This algorithm was implemented by a program in C++. The results are described in the end of the section.

B. TWO ODD AND ONE EVEN m i
It is clear that at most one of m 1 , m 2 , m 3 can be even. Thus, we complete consideration of triples with diagonal modulus SQ = 2 k − 1 by analysis of triples (m 1 , m 2 , m 3 ), where m 1 and m 2 are odd and m 3 is even. We call such triples even. Following the above scheme, we write where q, r, and s are odd and mutually coprime. Now Note that ω 1 < ω 2 . So far This means that we need to choose q as a solution of qr + s = 2 for some , thus getting Noting that q = m 3 2 ρ shows the similarity with the case of three odd m i (it can be deduced from here by setting ρ = 0). Similarly, to above, the following statement follows. Theorem 3: The so chosen m 1 , m 2 , m 3 form a good RNS triple with SQ = 2 k − 1, where k = + ω 2 . Each even triple can be obtained this way.
In this case the classification can be organized as follows.
Step 1. For fixed even positive integer A, generate all pairs (m 1 , m 2 ) of odd positive integers such that m 1 + m 2 = A, m 1 < m 2 .
Step 4. Check in increasing order solutions q = (2 − s)/r with 's from Step 3, finding We implemented both algorithms for finding all suitable triples with m i ≤ 1000, i = 1, 2, 3. With running time less than a second, the program produced 412 triples shown in Appendix A. These results were also confirmed by a brute force search. Further, there are 5573 good triples (2301 even and 2372 odd) with m i ≤ 10000, generated by our algorithm in few hours, as SQ = 2 27 − 1 appears as largest.
Theorem 4: There are exactly 412 triples (m 1 , m 2 , m 3 ) (as given in Appendix A) such that m i ≤ 1000, i = 1, 2, 3, and SQ = 2 k − 1 for some positive integer k. Further, there are exactly 5573 good triples (2301 with two odd and one even m i and 2372 with three odd m i ) with m i ≤ 10000, i = 1, 2, 3, and SQ = 2 k − 1. Proof: The necessary conditions from above mean that all possible triples are generated as described. The sufficiency follows from Theorems 2 and 3.
The list of all 5573 triples from Theorem 4 is available upon request.

IV. RNS QUADRUPLES WITH SQ = 2 k
For n = 4 the targeted SQ (odd or even) determines the type of the quadruple (m 1 , m 2 , m 3 , m 4 ). Since we need SQ = 2 k , all m i must be odd. Since it follows from Lemma 1 that the first necessary condition is where r and s are odd and coprime. Now , whence 2 k−ω 1 = rs2 ω 1 (m 1 + m 3 ) − (rm 2 3 + sm 2 1 ). Using Lemma 1 again (see the remark), we require (10) where q and t are odd coprime positive integers (in fact, all four numbers r, s, q, t are odd and mutually coprime).
Finally, we have to search for rsq − t = 2 ω 3 for some positive integer ω 3 .
Thus, we propose classification as follows.
Step 3. Check if rsq − t is a power of 2.
We implemented the above algorithms for finding all suitable quadruples with m 1 + m 2 + m 3 + m 4 ≤ 4000. The running time was a few hours on a home PC. The program produced 31 quadruples shown in Appendix B. The results with m i ≤ 1000 were confirmed by a brute force program (working 9 hours on GPU).
Proof: The necessary conditions from above mean that all possible quadruples are generated as described. The sufficiency follows from Theorem 5.

V. BALANCE METRIC FOR BUILDING EFFECTIVE COMPUTATIONAL SYSTEMS
The practical implementation of the arithmetic operations of addition, subtraction and multiplication in RNS with modules (m 1 , m 2 , . . . , m n ) is based on the parallel execution of operations for each of the modules m i , i = 1, 2, . . . , n. In the general case, the addition of two numbers modulo m i has computational complexity ≈ O(b i ), where b i = log 2 m i is modulo m i bit-width. Multiplication of two numbers modulo m i generally has computational complexity ≈ O(b 2 i ). If all RNS modules have a very differentbit-width, then this will lead to a long idle time of the computational elements for lowbit-width modulo while computing for modules of higher bit-width [13]. This phenomenon is called unbalanced RNS [14].
The triplets and quadruples found in the Sections III and IV, listed in Appendices A and B, are not equally balanced. We introduce the concept of a measure of the RNS balance. Let the RNS be defined by modules (m 1 , m 2 , . . . , m n ) with bit-widths (b 1 , b 2 , . . . , b n ). Denote average bit-width of RNS modules ā Obviously, the largerb implies the greater range M of RNS. Let us define a measure of RNS balance, due to absolutely absence of metrics for RNS balance measuring in the literature, as a quantity β, determined by the formula and calculated as the dispersion of bit-widths of the RNS modules. We assume the more balanced of the two different RNSs that one, in which β is smaller. Definition 1: An RNS is called perfectly balanced if β = 0. VOLUME 8, 2020  The cases examined show that quadruples (5, 29, 93, 313) and (3,7,43,2323) are least suitable for practical use, since they, firstly, have the smallest ranges (log 2 M < 23), and secondly, are very unbalanced, as they have the largest values β. Quadruples (43, 51, 79, 91) and (23, 43, 87, 143) have large and approximately equal ranges (log 2 M > 23), however, in practice preference should be given to the quadruple (43, 51, 79, 91), since it is very well balanced, which is confirmed by the smallest value β.
It is shown in Fig.1 the distribution of all 412 triples according to value of β. It can be noticed that we have 22 perfectly balanced triples.
In Fig.2 the distribution of 31 quadruples according to value of β is presented. We can't see any perfectly balanced quadruples but have one quadruple with β = 0.25.
When developing a computing system in RNS, it is necessary to take into account the range and requirements for the number of modules. After determining these parameters from Appendices A and B, the most promising RNS moduli sets are these with the lowest possible β. These results can be used for building effective parallel computational systems [15] based on computers with parallel structure like FPGA and GPU [16], [17]. The basic idea of a hardware implementation is that an algorithm (division, sign detection, comparison of numbers, reverse conversion) based on a diagonal function requires division by SQ. Since we were able to find such quadruples for which SQ = 2 n , for such RNS the algorithm based on the diagonal function will be extremely better than an algorithm based on Chinese remainder theorem (CRT) [18], CRT with fractional values (CRTf) [19] and mixed radix conversion (MRC) [20]. For example, division with the remainder by 2 n , in fact, costs nothing, unlike division by M in CRT or multiplication by M , as in CRTf or different operations on the modules m 1 , m 2 , . . . , m n , as in MRC. This is confirmed by our previous studies, so in [7] there is an example of the implementation of non-modular operations of comparison and reverse conversion for triples and quadruples, which demonstrates the advantage of our proposals for the hardware implementation of systems based on RNS in FPGA.

VI. HARDWARE MODELING
The hardware modeling goal is comparison of circuits on the example of a problematic comparator device with the known {5, 29, 93, 313} moduli set from [7] and the proposed {43, 51, 79, 91} moduli set which has the same SQ = 2 20 and the lowest β. In this regard, the operation of magnitude comparison two numbers in RNS was implemented in FPGA. All simulated circuits were described in very high-speed integrated circuit (VHSIC) hardware description language (VHDL). Hardware modeling was performed on Xilinx Artix     7 xc7a200tfbg484-2 in Vivado 2018.3 and the strategy of synthesis was highly area optimized. The modeling results were taken from an implementation run report. The Fig.3 shows a block diagram of the magnitude comparison operation using the diagonal function of the form 2 k . The moduli set {m 1 , m 2 , . . . , m n } has bit widths a 1 , a 2 , . . . , a n . Multiplication by constants is performed using the compression technique from [19]. First, partial product generator (PPG) forms partial products. Then, they are summed by carry save adder (CSA) tree and Kogge-Stone adder (KSA). The results of modeling are shown in Table 1.
These results confirm our conclusions based on values of log 2 M and β. As for hardware resources, in [20] the authors shown that selection of the appropriate moduli set should be based on the final usage of the RNS system. At first, we need to cover required dynamical range but for usage of hardware resources dynamical range is the most critical. Values of modeling results were normalized using division by bit width. The normalized results show that the use of {43, 51, 79, 91} moduli set allows to reduce delay of device by 11.58%, but requires 23% more number of LUTs and 32.59% more power consumption compare to the using {5, 29, 93, 313} moduli set. This point defines the difference in hardware usage for these moduli sets. In other words, a suitable trade-off is needed according to dynamical range and balance, that is why it is not possible to introduce single moduli set which is best for all applications.

VII. DISCUSSION
The methodology developed in Sections 2-4 can be applied for investigations in other cases. It could be quite easily replicated for classification results for triples with SQ = 2 k + 1. The application for quadruples with SQ = 2 k ± 1 could be very similar to our treatment of the case of triples with two odd and one even modules. Indeed, in this case we can start (signifying m 4 to be even) with the representatio Other, more complicated, forms of SQ will require suitable generalizations (in fact, consideration of different cases) of Lemma 1.
The work [21] raises the question of finding the best parameters of the RNS in terms of the number of modules and their performance. The approaches proposed in this paper allow us to answer the question about the best RNS moduli sets in terms of the performance of algorithms based on the diagonal function. In our opinion, the best practical solution for performance is the choice of a RNS with four modules covering the given range M and having the smallest β. Suitable cases for ranges from 21 to 36 bits can be found in Appendix B. Such ranges are usually sufficient to solve most practical problems in digital processing of signals and images. Another important issue in the RNS theory is the problem of the effective implementation of the reverse conversion. As shown in [7], even unbalanced triples and quadruples can show good results for reverse converters based on a diagonal function. Therefore, the triples and quadruples found in this paper can further improve the result for reverse converters, due to the greater balance of the RNS modules.
Further research will be related with testing algorithms using SQ on FPGA and GPU. These algorithms will be used to develop faster methods of digital signal processing, cryptography, machine learning using the proposed quadruples for RNS with SQ. It would also be interesting to study the relationship between β and loss due to downtime in an unbalanced RNS. We can determine the connection between β and losses due to equipment downtime in an unbalanced RNS, but this requires a very large number of hardware implementations and this is a topic for a separate study. But even in this case, for each system, in order to classify by β, it will be necessary to first determine acceptable levels of losses. In other words, it would be interesting to theoretically or experimentally determine the threshold for β below which the RNS could be considered well balanced, and above which the RNS could be considered poorly balanced. Another interesting area of further research is the question of finding an RNS with a large number of modules (6, 8 etc.) and with SQ = 2 k . For example, it can be shown that there exist a unique such 6-tuple with elements less than 500. The main problem in this way is the increasing computational complexity of the search algorithm.

VIII. CONCLUSION
We presented heuristic algorithms for generation of RNS triples with SQ = 2 k − 1 and quadruples with SQ = 2 k for some k. Such classification results could be useful for designers of RNS in their search for good performance. Thus, we also discussed performance issues. Our approach is based on careful dealing with the exponent of 2 in the expressions which naturally arise in targeting the corresponding form of SQ. The measure of RNS balance was proposed. Also, perfectly balanced RNS were defined and found among triples.

APPENDIX A
See Table 2.

APPENDIX B
See Table 3.
Remark: k is even, between 20 and 30. During his research and participation in projects, he has gained great experience in applications of mathematics and informatics in different areas as communications, data protection, and modeling processes. He has published more than 80 papers in journals, conference proceedings, and books, and has research visits in institutions in Russia, Germany, The Netherlands, Sweden, U.S., and Hungary. His research interests include coding theory, combinatorics, numerical analysis, and mathematical modeling.