Interactive Proofs for Rounding Arithmetic

Interactive proofs are a type of verifiable computing that secures the integrity of computations. The need is increasing as more computations are outsourced to untrusted parties, e.g., cloud computing platforms. Existing techniques, however, have mainly focused on exact computations, but not approximate arithmetic (e.g., floating-point or fixed-point arithmetic). This makes it hard to apply them to certain types of computations (e.g., machine learning or financial applications) that inherently require approximate arithmetic. In this paper, we present an efficient interactive proof system for arithmetic circuits with rounding gates that can represent approximate arithmetic. The main idea is to reduce the rounding gate into a small sub-circuit without rounding, and reuse the machinery of the Goldwasser, Kalai, and Rothblum’s protocol (also known as the GKR protocol) and its recent refinements. Specifically, we shift the algebraic structure from a field to a ring to better deal with the notion of “digits”, and generalize the original GKR protocol over a ring. Then, we reduce the rounding operation to a low-degree polynomial over a ring, and develop a novel, optimal circuit construction of an arbitrary polynomial to transform the rounding polynomial to an optimal circuit representation. Moreover, further optimization on the proof generation cost for rounding is presented employing a Galois ring. Our experimental results show the efficiency of our protocol for approximate arithmetic, e.g., the implementation performed two orders of magnitude better than the existing system for a nested 128 by 128 matrix multiplication of depth 12 on the 16-bit fixed-point arithmetic.


I. INTRODUCTION
Verifiable computing (VC) [2], [3], [4], [5] aims to ensure the integrity of computations performed by an untrusted party. In the cloud computing era, as more computationally heavy tasks are delegated to the cloud, VC is considered as a compelling solution for proving their security. The existing literature has demonstrated the feasibility of several basic primitives, such as addition, multiplication, comparisons [6], set operations [7], and key-value store retrieval [8]. Using these primitives, VC was shown feasible for a number of tasks, including matrix multiplication [9], [10], [11], certain SQL-like queries [12], and state-machine updates [13].
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Huang .
The existing literature, however, has mainly focused on exact computations. For example, they deal with verifying 1.11 × 2.22 = 2.4642, but not 1.11 × 2.22 ≈ 2.46, although 1.11 × 2.22 2 = 2.46 (where · 2 denotes rounding to two decimal places). Not supporting approximate arithmetic (e.g., fixed-point or floating-point arithmetic), the existing techniques are hard to apply to certain types of computations (e.g., machine learning) that require approximate arithmetic.
In particular, an approach to deal with approximate arithmetic in the existing VC systems over a finite field F p is to use the integer scaling method. i Specifically, in the integer scaling method, fractional number inputs are multiplied by some i We also compare other approaches using boolean circuit representation or commitment scheme in Section VII-E. scaling factor to be regarded as an element in F p (e.g., from 1.23 to 123), and the size of prime p is set to be bigger than all intermediate values during the computation. Then, plain multiplication over F p can be used for the approximate arithmetic multiplication, where the computation results need to be interpreted as fractional numbers by dividing them by their accumulated scaling factor. Note, however, that in this approach, the bitsize of intermediate values grows exponentially in the depth d of an arithmetic circuit, and thus the bitsize of p should be exponential in d. Therefore, this integer scaling method incurs roughly (2 d ) cost blowup, which is infeasible in practice.
A. OUR MAIN RESULT 1) PROBLEM FORMULATION: VERIFIABLE COMPUTING FOR FIXED-POINT ARITHMETIC Suppose we are given an arithmetic circuit on fixed-point arithmetic with η fractional bits. For simplicity of description, we assume that all inputs and intermediate values during computations are contained in [0, 1), i.e., unsigned fixed-point numbers with no integer bits. Then, the fixed-point arithmetic can be translated to an arithmetic over Z 2 2η := Z/2 2η Z (i.e., a ring of integers modulo 2 2η ) with an additional rounding operation, as follows: (i) Every input ∈ [0, 1) is multiplied by 2 η to be regarded as an element of Z 2 2η ; (ii) The fixed-point addition and multiplication translate to usual addition and multiplication over Z 2 2η followed by rounding, respectively; (iii) The rounding operation follows each multiplication to extract η most significant bits, i.e., x → x/2 η . ii Now, the problem of verifiable computing for fixed-point arithmetic can be reduced to verifiable computing for arithmetic circuit over Z 2 2η with the rounding gates.

2) OUR MAIN RESULT
We present an interactive proof protocol for an arithmetic circuit with the rounding gates, whose complexity is described in the following theorem.
Theorem 1: Let C be a layered arithmetic circuit over Z 2 2η where the multiplication gate performs regular multiplication followed by rounding, i.e., (x, y) → (xy)/2 η . (Thus, C corresponds to a fixed-point arithmetic circuit with η fractional bits only.) Let us fix η. Let S be the size of C, d be the depth of C, and n be the number of inputs. Then, there is an interactive proof protocol (P, V) for the function C : ii Here, the most significant bits are extracted considering the output of multiplication as 2η-bit element. Indeed, the proper rounding is x → x/2 η which is easily expressed with · as x → (x + 2 η−1 )/2 η , and we use · for simple description in the following section. iii We do not take into account the V's cost for computing wiring predicate [9], or we assume that the circuit is highly regular [14]. not take into account the η factor in the asymptotic costs, since η is fixed to a small constant.
Our protocol is based on Goldwasser, Kalai, and Rothblum's interactive proof system (GKR protocol) [4] and its recent refinements [9], [15], [16]. The costs of the latest GKR protocol variant [16] (that do not support fixed-point arithmetic) are O(S), O(n + d log S) iii , and O(d log S) for P, V, and communication, respectively. Thus, the additional cost to support fixed-point arithmetic in our protocol is roughly quadratic in the depth of the circuit.
In Table 1, we compare the asymptotic complexity of our protocol with the integer scaling method applied on top of the latest GKR variant. The integer scaling method incurs exponential overhead in circuit depth to deal with fixed-point arithmetic, as mentioned earlier.

3) EXPERIMENTAL RESULT
We conducted experiments to quantify the performance of our protocol. In a moderate laptop, for 2 12 number of 16-bit rounding operations, the proof generation took a second, while the proof verification took less than a millisecond. We also experimentally show that our protocol is much more efficient than the integer scaling method. Given a nested 128 × 128 matrix multiplication of depth 12 over fixed-point numbers with 16-bits below the decimal point, our refinement took 3 minutes to generate a proof for each matrix multiplication, while the integer scaling method took 2.5 hours for the same task (Section VII). The gap between the two will increase exponentially as the depth of multiplication increases (e.g., the depth often increases to hundreds or thousands in neural network training).

1) OUR GOAL AND DESIGN RATIONALE
The goal is to construct an interactive proof system for fixed-point arithmetic circuits, i.e., arithmetic circuits with rounding gates. The idea is to represent each rounding gate by a small sub-circuit without rounding, and reuse the machinery of the GKR protocol on it. Specifically, we consider an arithmetic circuit over a ring Z p e := Z/p e Z (i.e., integers in a base p system of e digits) with the (floor) rounding operation (x → x/p ), where p is a prime and e is a positive integer. (Note that proper rounding, x/p , can be represented using floor rounding, i.e., x/p = (x + p− 1 2 )/p .) The main VOLUME 10, 2022 reason of our introduction of this ring (instead of the usual finite field) is that the rounding operation can be represented much more efficiently (than in the usual finite fields). Below we explain each of our main technical developments.

2) REDUCING A ROUNDING OPERATION TO A COMBINATION OF RING OPERATIONS AND DIVISION OPERATIONS (SECTION IV)
We present a sub-circuit representation of a rounding gate over the base p system, i.e., over Z p e . At first, we employ the lowest-digit-removal polynomial [17], denoted by ldr, which sets the least significant digit to zero, i.e., ldr : x → x/p ·p. Thus, we can represent the floor rounding operation by ldr(x)/p. The key observation is that the degree of ldr is small (less than ep), and that the division-by-p operations are naturally verifiable in the paradigm of GKR protocol. We also provide an optimal arithmetic circuit for evaluating ldr and proving the result efficiently as we explain below.

3) OPTIMAL CIRCUIT CONSTRUCTION FOR ARBITRARY UNIVARIATE POLYNOMIAL (SECTION V)
In the GKR protocol (as well as our one), a computation of interest needs to be represented in the form of an arithmetic circuit, and the performance of the protocol is substantially affected by the structure of the circuit. To this end, we devised a novel, optimal circuit construction of an arbitrary polynomial for the GKR protocol. The constructed circuit is regular with depth O(log d) and size O(d) where d is the degree of the polynomial, and it is optimal in that the proof generation complexity is also O(d). Moreover, in case that the same polynomial is evaluated on m parallel inputs, our circuit construction yields a circuit of size O(md) and depth O(log d), for which the proof generation cost is O(m √ d + d), which is sublinear in the circuit size (i.e., the proof generation is even faster than the circuit evaluation!), while the previously best known result is linear [16]. This improvement of the proof generation cost is critical, since such a single-polynomialmultiple-inputs computation is common in data-parallel computing as well as neural network training (e.g., the activation function of each layer is pointwisely applied to a weight vector/matrix).
To achieve this, we carefully redesign Paterson-Stockmeyer algorithm [18] (for polynomial evaluation) into a regular circuit, and manage the (costly) constant multiplication part by one linear-sum gate, (x 1 , · · · , x n ) → a 1 x 1 + · · · + a n x n which can be efficiently proved and verified via GKR protocol.

4) GENERALIZATION OF THE GKR PROTOCOL OVER A RING (SECTION III)
While the original GKR protocol is valid over a finite field, since the domain Z p e we consider is no longer a field for e > 1, we identify a minimal modification to the original protocol to admit a ring (Section III-B), and present its construction for a specific family of rings, i.e., Z p e and its extension rings.
Specifically, the GKR protocol is based on the Sum-Check protocol that in turn is based on the Schwartz-Zippel lemma. However, the Schwartz-Zippel lemma does not hold for a ring in general. To extend the original protocol, we first employ the generalized Schwartz-Zippel lemma [19] over a ring, which restricts the (randomness) sampling set to a subset of the domain such that the difference between any two elements of the subset is not a zero divisor. Then, we show that the Sum-Check protocol as well as the GKR protocol can be extended over a ring by restricting the (verifier's randomness) sampling set to a subset satisfying the aforementioned property. Moreover, we further identify a stronger condition for the sampling set (Remark 1), the ''unit difference'' property [20], that is, that the difference between any two elements of the sampling set has an inverse. This stronger condition allows us to employ all cost reduction techniques [9], [14], [16] proposed for the original GKR protocol to our extended protocol.
Hence, the extended protocol enjoys the same computational complexity for each participant with the original one, provided that the unit difference property holds for the sampling set A. However, the soundness probability becomes bigger (i.e., worse) than that of the original. It is proportional to 1/|A|, where the denominator is the size of the sampling set A, while it was the size |F| of the entire domain field for the original protocol. Note, however, that for practical purposes, the soundness probability can be quickly improved by simply having multiple prover-verifier pairs in parallel, which does not affect the overall throughput.

5) OPTIMIZATION OF PROOF GENERATION COST FOR ROUNDING OPERATION (SECTION VI)
Consider an approximate computation on Z p e . The underlying ring Z p e can be replaced by another ring Z q de with a much smaller prime q d √ p, via base conversion, that is, converting numbers in the base-p system to the corresponding numbers in the base-q system. iv Here the advantage of employing Z q de is that the degree of the rounding polynomial in Z q de is much smaller than that of Z p e , which in turn significantly reduces the proof generation cost for rounding. However, employing Z q de leads to sacrificing the soundness of the protocol. To mitigate this dilemma, we proposed a technique that allows us to employ Z q de without compromising the soundness, by exploiting an interesting property of a Galois ring.
Specifically, we employ a Galois ring, is a monic irreducible polynomial, in the proof generation and verification phases, while we keep using Z q de in the circuit evaluation phase. This allows us to employ a smaller prime q ∼ d √ p where d is the degree of f (t).
iv The converted number may be marginally different from the original, but such an inaccuracy is acceptable in approximate computation such DNN training.
Employing a smaller prime leads to further reducing the size of the rounding circuit, since the degree of the lowest digit removal polynomial drastically decreases from ep into edq ed d √ p. Note that the soundness probability is not compromised at all with the smaller prime q, because the extension ring yields a sampling set of similar size, q d p, to that of the original one (Theorem 5).
However, there is a cost overhead when employing a Galois ring, since the operations on a Galois ring become more expensive as its degree d increases. Thus, having a too small prime q may offset the aforementioned cost benefit. Nevertheless, one can find an optimal q given a set of parameters, and our experiment showed that two orders of magnitude performance improvement can be made by finding such a sweet spot (Section VII).

C. RELATED WORK
The problem of delegating computation with securing integrity has been extensively studied in both theory and practice perspectives. Here we mainly focus on the generalpurpose protocols and systems that aim to be practical.

1) INTERACTIVE PROOFS: GKR PROTOCOL AND ITS REFINEMENTS
Goldwasser, Kalai, and Rothblum [4] proposed an interactive proof protocol (also known as GKR protocol) that runs in polynomial time. For a layered arithmetic circuit of size S and depth d, the prover of their protocol runs in time poly(S), and the verifier runs in time poly(d, log S).
Several refinements of the GKR protocol have been proposed to improve the cost of the protocol, especially the prover's cost. Cormode, Mitzenmacher, and Thaler [9] presented a refinement of the GKR porotocol (hereafter, CMT) that allows the prover to run in O(S log S). Thaler [14] further improved the protocol, which allows the prover to run in O(S) for a circuit with a ''highly'' regular wiring pattern. Subsequently, it has been shown that the prover's cost can be reduced when a circuit is composed of many parallel copies of subcircuits. Specifically, the prover's cost is reduced to O(S log S c ) in [12] and [15], and further reduced to O(S +S c log S c ) in [21], where S c is the size of a subcircuit. Recently, Xie et al. [16] proposed a refinement that allows the prover to run in O(S) for an arbitrary circuit. Although being asymptotically equivalent, Thaler's refinement [14] performs better than Xie et al.'s [16] for a ''highly'' regular circuit.
On the other hand, substantial efforts have been made to support more operations than the plain field arithmetic. Vu et al. [6] proposed an extension of CMT that supports inequalities by augmenting a circuit with additional verification logic and auxiliary inputs to be fed by the prover. However, their approach suffers from a significant overhead of the verifier due to the irregularity of their augmented circuit, which needs to be amortized by batching verifications (i.e., verifying the same circuit against many different inputs at the same time) for practical purposes. Zhang et al. [12] improved this by combining CMT with a verifiable polynomial delegation scheme, and showed that an arithmetic circuit with auxiliary inputs can be verified efficiently wihtout batching strategies.
Note, however, that no existing interactive proof systems support a verifiable rounding operation efficiently, to the best of our knowledge, which is critical to deal with an approximate arithmetic circuit with a large depth. v Though Vu et al. [6]'s approach with auxiliary input can support rounding operation in principle, it forces verifier's cost to be linear in the number of rounding operations, since the verifier must check the auxiliary input which is at least as many as the number of rounding operations multiplied by the bitsize of rounding input. Therefore, it does not provide efficient verification of rounding operations. Zhang et al. [12], [28] and Wahby et al. [26]'s approach resolves this problem using commitment scheme. However, due to the use of commitment, their system became an argument that is secure only against computationally bounded dishonest prover, and it does not provide post-quantum security due to the specific polynomial commitment scheme based on pairing groups. Also, since the computational cost for commitment is higher than plain operations of interactive proofs, performance of our system for rounding operation can be better or comparable to their's when the bitsize of rounding operation is not large (see Discussion VII-E).

2) ARGUMENTS: NON-INTERACTIVENESS AND ZERO-KNOWLEDGE
Argument systems are different from interactive proofs in that they are secure only against computationally bounded dishonest provers. Employing cryptographic primitives, they can provide versatile properties such as non-interactiveness, public verifiability, and zero-knowledge proofs. However, the use of expensive cryptographic primitives incurs a significant overhead to the prover's cost.
On the other hand, there have been much efforts on developing argument systems without using the short PCPs. Setty et al. [11], [34], [35] proposed argument systems based on linear PCPs [36], where their systems were shown to achieve a practical performance in the batch verification v Although, in theory, the existing work can support rounding by degenerating to much verbose Boolean circuits, it is highly inefficient to implement such Boolean circuits in practice. VOLUME 10, 2022 setting. Gennaro et al. [37] introduced quadratic arithmetic programs (QAPs), a novel efficient encoding of computations for verifiable computing and even for zero-knowledge succinct non-interactive arguments (zkSNARKs). Much of improvements have been proposed [10], [38], [39], [40] to make the proving time practical and several works have been proposed via various techniques along with various trade offs in prover time, verifier time, and supporting functionalities (universality, without trusted setup, etc) [41], [42], [43], [44], [45], [46], [47], [48], [49], [50]. Still, such argument systems provide a prover with quasi-linear cost or inefficient verifier asymptotically. All of those work, however, are applicable only to computations over a finite field, and is not efficient for representing fixed point arithmetic accompanying rounding operations.
There also has been substantial work [13], [38], [39], [51], [52] to extend the coverage of verifiable computing to a more generalized form of computations. Essentially, they developed a ''compiler'' that translates C-like programs (with e.g., memory accesses and control flows) into corresponding arithmetic circuits (or algebraic constraints). However, their approaches often do not efficiently scale, due to the blowup in the size of generated circuits. On the other hand, [34], [35] presented an encoding of rational numbers in a finite field, but still did not support rounding, suffering from the same problem (i.e., the exponential blowup of the field size) with the integer scaling method described in Section VII.

D. TOWARD VERIFIABLE AI 1) MOTIVATION
This paper is motivated by our vision about verifiable AI computation. Specifically, consider a deep neural network (DNN) training task: it is a computation that takes a set of samples, and produces an output model represented as one or more matrices. The computation often takes hours or even days. Should the training set be poisoned or the training machine(s) be compromised, the output model would have potentially devastating hidden behaviors [53]. Unlike programming bugs or malicious code, compromised AI models are extremely difficult to detect, because the models are nothing but some matrices. However, if verifiable AI computation is achieved, we will be able to trust a model by only trusting the fundamental mathematics, not any other factors such as human operators, program, or platform doing the training.
AI computations are many orders of magnitude heavier and involve more challenging operations than the aforementioned primitives in the VC literature, so it could be a long journey to fully realize the vision. Specifically, DNN training processes mainly consist of an overwhelmingly large amount of computing matrix multiplication and a relatively small amount of computing various non-linear functions such as ReLU, maxpooling, and softmax, where all the operations are performed using approximate arithmetic such as fixed-point or floatingpoint arithmetic. Here the following fundamental operations are required for DNN training computations: (1) approximate arithmetic, (2) comparison (for ReLU and max-pooling), and (3) natural exponentiation e x (for softmax and sigmoid). While the comparison operation was shown to be feasible in [6], and e x can be approximated as a (piecewise) polynomial, vi approximate arithmetic is not supported in existing VC schemes (e.g., acknowledged in [6] and [34]), to the best of our knowledge. vii

2) APPLICATION TO VERIFIABLE AI
We believe that this work is an important step toward the vision of verifiable AI computations. Specifically, the DNN training iterates the forward and backward passes over the sequence of layers, where each layer computation (in both forward and backward passes) consists of matrix multiplication and nonlinear function application on approximate arithmetic. Without the ability of rounding, the number of digits of the computation results will keep increasing and exceed the limit. Thus the existing VC approaches are not capable in the AI space. Our approach gives a theoretical feasibility for these computations. In addition, it also sheds light on the real-world performance -as shown in Section VII, matrix multiplication on the fixed-point arithmetic can be efficiently supported by our scheme.
Among the nonlinear functions, the ReLU and maxpooling functions can be represented in an (approximate) arithmetic circuit by using the comparison operation [6], [12]. The sigmoid and tanh functions were shown to be effectively approximated as a polynomial [55] with achieving a sufficient accuracy, while such a polynomial can be efficiently represented in a circuit by using our optimal circuit construction. The softmax function requires to compute the natural exponentiation function e x , which can be also approximated as a polynomial for x ≤ 0, using the input normalization [54], as mentioned in Section I.
Moreover, multiple iterations can be ''squashed'' [21] into a wide and shallow circuit by laying identical subcircuits of a single iteration side by side. This squashing can drastically reduce the depth of a circuit, which can significantly improve the protocol's performance [15], [21] at the cost of communication overheads. Finally, the protocol performance can be further improved by using hardware accelerators such as GPUs [15], [25] and ASICs [21], [24].

3) OTHER APPLICATIONS OF INDIVIDUAL RESULTS
As mentioned in the introduction, the individual technical results that we developed for the verifiable rounding operation have their own applications as well. First, our generalized GKR protocol can be used in other settings where rounding is not necessarily involved. For example, a ring Z p e has a vi In particular, it is a well-known practice to use an input normalization for e x [54] to avoid overflow when computing softmax or sigmoid, in which case x ≤ 0, and thus a (piecewise) polynomial approximation of e x for x ≤ 0 can provide a sufficient precision since e x is converging in the negative domain.
vii Although the technique used for comparison [6] has a potential to be used for rounding, their approach inherently introduces a significant overhead which is not involved in our approach. See Section I-C for more details.
nice property that addition and multiplication on Z p e are equivalent to that of the e-bit machine integer arithmetic when p = 2, including the ''wrapping-around'' behavior in case of overflow (e.g., ''4 + 4 ≡ 0'' in both Z 2 3 and the 3-bit (unsigned) machine integer arithmetic). Thanks to this property, for certain computations that inherently require the modular arithmetic (e.g., ones in cryptography implementations), one can construct arithmetic circuits of such computations at no extra cost. viii Note that to admit such computations with the original GKR protocol, one needs to additionally develop a circuit representation of the modulo reduction, i.e., x → x mod 2 e , which incurs additional overheads in protocol performance due to the circuit size blowup.
On the other hand, our optimal circuit construction is applicable to the original GKR protocol (and its variants) as well, since it is not specific to the underlying algebraic structure. That is, when a given computation involves evaluation of certain polynomials, our circuit construction scheme can be used to optimize the protocol performance.

II. PRELIMINARIES
In this section, we review a number of basic concepts about verifiable computing (VC). In a VC scenario, a customer delegates a computation to an untrusted platform, and wants to be assured that the computational result is correct. The untrusted platform is called the Prover P, and the customer is called the Verifier V. As a motivating scenario of our work, V submits a DNN training job to P, which may take days to run. The goal of VC is to give V the ability to quickly verify the correctness of the computational result provided by P, without re-running the same job by V itself.
Next, we delve a little deeper into the type of VC we discuss in this paper, the interactive proof protocol.

A. INTERACTIVE PROOF PROTOCOL
We start with a definition of the interactive proof protocol for a function f as follows.
Definition 1 (Interactive Proof Protocol for f [9], [14]): Consider a prover P and a verifier V who wishes to compute a function f : X → Y . For an input x ∈ X chosen by V, P gives the claimed output y to V. Then, they exchange a sequence of messages and V accepts or rejects.
Pr[V accepts] < δ. We will call δ the soundness probability bound. If P and V exchange r messages in total, we say the protocol has r/2 rounds.
viii In this case, the optimization via a Galois ring (Section VI) is needed to secure a sampling set that is large enough for the protocol soundness. Moreover, the same technique is applicable to a more general ring Z n for an arbitrary integer n by using the Chinese remainder theorem, i.e., reducing operations on Z n to that of i Z p e i i where i p e i i is the prime factorization of n. Note that the modular arithmetic on Z n is commonly used in, e.g., the lattice-based cryptography [56].
The GKR protocol is an interactive proof protocol for the evaluation of a layered arithmetic circuit over a finite field F where the circuit is composed of addition and multiplication gates (over F) of fan-in 2.

B. SCHWARTZ-ZIPPEL LEMMA
We first recall the Schwartz-Zipple lemma, the Sum-Check protocol, and the Multilinear Extension which constitute the GKR protocol.
Lemma 1 ): Let F be a field, and f : F ν → F be an ν-variate nonzero polynomial of total degree (the sum of degrees of each variable) D. Then on any finite Note that the lemma implies that two different polynomials can coincide at only tiny fractions of points. It contributes to soundness of the following sum-check protocol and thus the GKR protocol.

C. SUM-CHECK PROTOCOL
The Sum-Check protocol [58] is an interactive proof protocol for a specific summation function S(f ) over a field F as follows.
Theorem 2: (Sum-Check Protocol [58]) Let F be a finite field. Let f : F ν → F be an ν-variate polynomial of degree at most d < |F| in each variable. The Sum-Check protocol is an interactive proof protocol with soundness νd |F| for the function: Protocol description: the protocol proceeds in n rounds. In the first round, P sends the value S(f ), and a polynomial V checks if f 1 (0) + f 1 (1) = S(f ), and rejects otherwise. In the i-th (2 ≤ i ≤ ν) round, V chooses r i−1 randomly from F, and sends it to P. In response, P sends a polynomial (1), and rejects otherwise. After the final ν-th round, V accepts if f ν (r ν ) = f (r 1 , r 2 , . . . , r ν ) for a random element r ν ∈ F, and rejects otherwise.
Proof: For soundness condition, see Appendix B-A. Note that the soundness of sum-check protocol is based on Schwartz-Zippel lemma (Lemma 1); if dishonest P sent wrong value S (f ) = S(f ), he must send a polynomial f 1 (t) = f 1 (t) resulting in f 1 (r 1 ) = f 1 (r 1 ) with high probability. Repeatedly, the lemma forces P to send polynomials f i (t) = f i (t) and to be rejected (at the final round) with high probability.
Note that the sum-check protocol enable V to reduce the verification task on S(f ) to that on the evaluation of f on one random point. It is the core functionality of sum-check in the GKR protocol. VOLUME 10, 2022

Lemma 2 (Multilinear Extension [9]): Given a function
The existence of multilinear extensionṼ is guaranteed from the following construction.
The uniqueness of multilinear extension is also straightforward, see Appendix B-B. In GKR protocol, the output of each layer in the circuit give rise to the unique multilinear extension.

E. GKR PROTOCOL
Now we describe the GKR protocol x which is an interactive proof protocol for the evaluation of a layered arithmetic circuit over a finite field F. We only give an overview of the protocol, and a detailed description can be found in [4], [15] or in Section III-B.

1) OVERVIEW
Assume we are given a layered arithmetic circuit (over F) of input n, depth d, of size S, and of fan-in 2. Each layer is composed of gates outputing addition or multiplication of two inputs. The layers are numbered in a way that output layer is 0, input layer is d, and gates of i-th layer take as input the output of gates in i + 1-th layer. Let S i denotes the size of i-th layer, and assume it is a power of 2, i.e., S i = 2 s i for simplicity. We can number each gate of i-th layer with a binary string in {0, 1} s i , and it defines a function V i : {0, 1} s i → F relating the given binary string to output of the corresponding gate. Let V i be the multilinear extension (MLE) of V i , then there exists an interesting relation between MLEs defined from adjacent layers as follows [59]: (We omit the vector notation, e.g., z, whereãdd i (orm ult i ) is a MLE of a function which is 1 only if the input binary strings indicate an addition (or multiplication) gate and its corresponding two gates providing inputs, and 0 otherwise. We callãdd i (andm ult i ) wiring predicates [9]. Now, the GKR protocol proceeds in layer by layer starting from the output layer. V having an output of the circuit, gets x In particular, we describe the simplest form [59] by Thaler. a claimṼ 0 (z 0 ) = v 0 evaluatingṼ 0 on random point z 0 . Then, she reduces the claim toṼ 1 (z 1 ) = v 1 executing the sum-check protocol on the relation of MLEs we described. Continuing this process layer by layer, she finally gets a claim thatṼ d (z d ) = v d , and checks if it is correct by evaluating V 0 defined with her inputs.

2) COMPUTATIONAL COMPLEXITY AND SOUNDNESS
A number of refinements [6], [9], [14], [16], [21] on GKR protocol have reduced the computational complexity cost of the protocol. Finally, the cost of P and V in the number of operations over F, and the communication cost C in the number of elements of F are as follows [16]: xi where S and d denote the size and depth of the circuit, respectively, and n denotes the number of inputs. The soundness probability λ s is bounded by (7d log S + log n)/|F|. We note that the prover's cost can be broken down into the circuit evaluation cost and the proof generation cost, and we will show certain circuits for which the proof generation cost is smaller than the circuit evaluation cost (Section V).

F. NOTATION AND COST MODEL
In this paper, Z, Z N , and F denote the ring of integers, the ring of integers modulo a positive integer N , and a finite field, respectively. Also, all logarithms are of base 2. The (computational) cost of P or V is measured by the number of arithmetic operations over the corresponding domain, such as F or Z N . The communication cost is measured by the number of elements of the corresponding domain.

III. GENERALIZATION OF GKR PROTOCOL OVER A RING
In this section, we show that the GKR protocol can be applied to an arithmetic circuit over a ring, a more general algebraic structure than a field. Throughout this paper, we refer a ring R to a finite commutative ring with the multiplicative identity 1. It is similar to a field in that it has two operations, i.e., addition and multiplication that is distributive over addition, an additive identity 0, and a multiplicative identity 1. It also has an additive inverse for every element, but does not necessarily have a multiplicative inverse, in contrast to a field. A zero divisor of a ring R is an element x ∈ R which divides 0, i.e., there exists a nonzero element y ∈ R such that xy = 0. An integral domain is a ring that has no zero divisors other than 0. Typical examples of ring are Z (integers) and Z N (integers modulo N ). Note that Z is an integral domain, and Z N is a field if N is a prime, but is not even an integral domain otherwise. xi We assume that the wiring predicates of circuit are efficiently computable by V [9]. If not, the cost can be amortized by batching [6] or dataparallel computations [15], [21].

A. SCHWARTZ-ZIPPEL LEMMA AND SUM-CHECK PROTOCOL OVER A RING
Since the original GKR protocol is based on the Schwartz-Zippel lemma (Lemma 1), the starting point of generalization is also the lemma. Here we exploit more generalized form given by Bishnoi et al. [19] as follows.
Lemma 3 (Generalized Schwartz-Zippel [19]): Let R be a ring, and f : R n → R be an n-variate nonzero polynomial of total degree (the sum of degrees of each variable) D over R.
We will call A a sampling set.
Proof: Appendix B-C This lemma guarantees that the identity check of a polynomial over R can be done similarly as in a field if we sample the random points from a sampling set A ⊆ R.
Example 1: Let R = Z p e for an odd prime p, and Proof: Appendix B-D. Note that the soundness probability is nd |A| in contrast to nd |F| in Theorem 2.

Remark 1 (Additional Condition for Efficient Specification of f i (t)):
In the i-th round of the Sum-Check protocol, (honest) P should provide

B. GKR PROTOCOL OVER A RING
Now we present a generalized GKR protocol over R. We can see that the original GKR protocol can be applied to an arithmetic circuit over R by restricting random points required in the protocol to the sampling set A of Lemma 3. Below we clarify and validate the modification made in each step of the protocol.

1) MULTILINEAR EXTENSION AND INITIAL STEP
We first need to ensure that the existence and uniqueness (Lemma 2) of Multilinear Extension (MLE)Ṽ : R µ → R extending a function V : {0, 1} µ → R. It follows from the fact that the proof of Lemma 2 is valid in R since it exploits only properties (i.e., commutativity and distributivity of addition and multiplication, and existence of the multiplicative inverse 1) that hold in R as well. At the initial step, V reduces the task of checking output values to that of checkingṼ 0 (z 0 ) = v 0 whereṼ 0 is a MLE of the output values. In the original protocol, the reduction is valid by Lemma 1. In the generalized protocol, the reduction is valid by Lemma 3, provided that V samples the random point z 0 from the set A of Lemma 3.

2) APPLYING SUM-CHECK PROTOCOL
We already have shown that the Sum-Check protocol is valid in R as well by Theorem 3. Therefore, reducing the task of checkingṼ i (z i ) = v i to that of checking bothṼ i+1 (ω * 1 ) = v i+1,1 andṼ i+1 (ω * 2 ) = v i+1,2 can be done using the generalized Sum-Check protocol. Note that V samples each random point from the set A in the generalized Sum-Check protocol.

4) COMPLEXITY AND SOUNDNESS
Note that the computational cost of the generalized protocol is the same with that of the original protocol (Equation 2) except that the cost is measured by the number of operations or elements of R instead of F. The cost reduction techniques [6], [9], [14], [16] proposed in refinements of GKR protocol are also applicable if R satisfies the additional condition introduced in Remark 1.
xii If x ∈ Z p e is not a multiple of p, gcd(x, p e ) = gcd(x, p) = 1, and ax + bp e = 1 for some a, b ∈ Z, i.e., a (mod p e ) ∈ Z p e is a multiplicative inverse of x. VOLUME 10, 2022 Soundness of the generalized GKR protocol follows from that of the generalized Sum-Check protocol. Hence, it has the same soundness with the original one except that |F| is substituted by |A| (see following Theorem 4).
Theorem 4 (GKR Protocol Over R): Let R be a finite ring, and C : R n → R be an arithmetic circuit of size S, depth d, with n inputs over R. Let A be the sampling set of R in Lemma 3. The generalized GKR protocol is an interactive proof protocol for the evaluation of C with soundness (7d log S +log n)/|A|. The cost of the generalized GKR protocol is the same with that of the original protocol described in Section II-E, if A satisfies the stronger condition of Remark 1. The complexity of the generalized GKR protocol is the same with that of the original protocol described in Section II-E.

IV. VERIFIABLE ROUNDING OPERATION
In this section, we explain how to support the rounding operation on top of the generalized GKR protocol described in Section III. As explained in Section I, we consider an approximate arithmetic circuit over a ring Z p e (i.e., integers in the base-p system) where p is a prime and e > 1, and the rounding gate that performs the (floor) rounding: x → x/p . xiii Like closely related previous work [4], [9], [14], we assume that the given circuit is layered. For the simplicity of the presentation, we also assume that the given circuit is structured to have rounding layers each of which consists solely of rounding gates, while the other layers have only addition and multiplication gates. xiv The idea is to replace each rounding gate with a combination of plain arithmetic gates, and use our generalized GKR protocol over Z p e . Specifically, we employ a low-degree polynomial ldr(x) such that x/p = ldr(x)/p, where ldr(x) can be represented as a circuit over addition and multiplication gates. (Later, in Section V, we will provide an optimal circuit construction for arbitrary polynomials including ldr(x).) Then, the rounding gate can be replaced with the circuit of ldr(x) followed by a division-by-p gate, x → x/p. Below we will explain what is the polynomial ldr(x), and how to verify the division-by-p gate in our generalized GKR protocol.
A. LOWEST-DIGIT-REMOVAL POLYNOMIAL OVER Z p e Chen and Han [17] recently showed the existence of a polynomial over Z p e that sets the input's lowest-digit to zero. They also provided an exact construction of such polynomial.
xiii As mentioned earlier, the proper rounding, x/p , can be represented using the floor rounding, i.e., x/p = (x + p−1 2 )/p . xiv An arbitrary circuit can be adjusted to satisfy this assumption by adding dummy gates (i.e., a multiplication-by-p gate followed by a rounding gate) for each non-rounding gate.
Note that the degree of ldr(x) is small: roughly logarithmic in the size of Z p e . It provides us an efficient representation of rounding as a combination of additions and multiplications.
Example 3 [17]: For e = 2, we have: As mentioned earlier, the rounding operation (t → t/p ) can be represented as t → ldr(t)/p. Here the problem is that division is not admitted in an arithmetic circuit over a ring (thus not in the generalized GKR protocol over a ring) in general. However, in ldr(x)/p, the division is always welldefined, since the result of ldr(x) is guaranteed to be a multiple of p, where p is constant. Also, as mentioned earlier, the given circuit is assumed to have a separate rounding layer that consists solely of rounding gates. Thus, the reduced circuit will have a separate division-by-p layer that also consists solely of the division-by-p gates, and we have the following equation:Ṽ whereṼ i (andṼ i+1 ) denotes the MLE of outputs (and inputs, resp.) of the division-by-p layer. Now, in the generalized GKR protocol, the verifier verifies the outputs of the divisionby-p layer by reducing the verification task ofṼ i (r) = v, to the verification task ofṼ i+1 (r) = pv. This reduction enjoys perfect soundness, since forṼ i (r) =Ṽ i (r), we havẽ V i+1 (r) = pṼ i (r) = pṼ i (r) =Ṽ i+1 (r) (mod p e ).

Remark 2 (Modulus Change at Division-by-p Layer):
Note that the codomain ofṼ i is Z p e−1 , while the codomain ofṼ i+1 is Z p e . That is, the outputs of each rounding layer should be regarded as an element of Z p e−1 while the inputs are elements of Z p e . This is because t = ap + b ∈ Z p e represents (ap + b) + np e ∈ Z for some n ∈ Z where 0 ≤ b < p, while t/p ≡ a + np e−1 ∈ Z is represented by a ∈ Z p e−1 .

V. OPTIMAL CIRCUIT CONSTRUCTION FOR POLYNOMIAL EVALUATION
In this section, we present a novel, optimal circuit construction for proof and verification of polynomial evaluation with the GKR protocol. The circuit has an optimal depth, and is regular so that a prover (and a verifier) enjoys an optimal cost (and high efficiency) when proving (and verifying) the circuit via the GKR protocol. It has an additional advantage when applied to the parallel evaluation of the same polynomial on multiple inputs, in which case, once a prover has evaluated the circuit, the proof generation cost becomes sublinear in the size of the circuit (i.e., the proof generation is much faster than even the circuit evaluation!), which is better than the previously best known results [16], [21].

A. OVERVIEW OF OUR CIRCUIT CONSTRUCTION
Our circuit construction is inspired by the Paterson-Stockmeyer algorithm [18] evaluating a polynomial g(t) of degree N in O( √ N ) non-constant multiplications. xv Specifically, for a given polynomial g(t) = N i=0 a i t i , our circuit is constructed to first compute N (k−1) t j , and then compute a 0 + √ N k=1 g k (t) · t √ N (k−1) , which gives g(t). For example, for a polynomial g(t) = a 0 + a 1 t + · · · + a 16 t 16 of degree 16, the constructed circuit (as shown in Figure 1) computes the polynomial as follows: a 0 + (a 1 t + · · · + a 4 t 4 ) + (a 5 t + · · · + a 8 t 4 ) · t 4 + (a 9 t + · · · + a 12 t 4 ) + (a 13 t + · · · + a 16 t 4 ) · t 4 · t 8 Here we note two properties of the above evaluation method that contributes to our optimal circuit construction. First, not all powers of t are needed, but only, for example, t, t 2 , t 3 , t 4 , and t 8 are. In general, only ( √ N , · · · , t N /2 , are needed to compute g(t) in the above evaluation method. Also, every sub-polynomial g k is computed using the same small subset of powers of t, that is, t, t 2 , · · · , t √ N . These properties contribute to reducing the circuit size, and increasing the circuit regularity. Now we describe certain observations that led us to our circuit construction. The first observation is that the GKR protocol admits any efficiently computable gate with fanin >2 without affecting the asymptotic complexity of the protocol, as long as the fan-in is constant. Also, the GKR protocol can admit a layer that solely consists of the linearsum gates, x → a i x i , at no cost overhead, by exploiting its nice evaluation structure, even if its fan-in is not constant (see Appendix A for more details). These observations give us more flexibility in constructing a circuit, and we utilize the linear-sum gate for the evaluation of g k 's, and the fused multiply-add gate, (x, y, z) → xy + z, for the summation of g k 's. This yields a circuit of width 2 √ N and depth (3+log N ) with a regular wiring pattern. Figure 1 shows our circuit construction of a single polynomial g(t). The circuit is composed of four parts. The first part referred to as polygen, consisting of log √ N layers with multiplication gates, takes as input t and computes its powers, t, t 2 , · · · , t √ N . The second part referred to as eval, consisting of a single layer over the linear-sum gates, computes the sub-polynomials g k (t)'s. The third part referred to as unify, consisting of log √ N layers over the fused multiplyadd gates, computes the summation of the sub-polynomials, g(t) − a 0 . Note that the unify part also computes the squarepowers, t 2 √ N , t 4 √ N , t 8 √ N , · · · , t N /2 by the side of the main computation, where the same multiply-add gate is used along with introducing dummy gates, to achieve a regular wiring pattern. The last part referred to as extract, consisting of a single layer of a constant-addition gate, computes the final result g(t). More details and a precise definition of our circuit construction are provided in Appendix A.
xv For the simplicity of the presentation, let N = 2 2n be the smallest power of four such that N ≥ deg(g). j =1 a j +4(k−1) t j . The green arrow denotes the linear-sum gate wiring. The gates computing zero are dummy gates that are added to achieve a regular wiring pattern and thus admit an optimal prover and an efficient verifier. The presence of the dummy gates does not affect the asymptotic cost.
In case that multiple inputs need to be evaluated on the same polynomial, our circuit construction simply puts multiple copies of the same circuit shown in Figure 1 side-by-side. This yields a circuit that has a larger width O(M √ N ) but the same depth O(log N ), where M is the number of inputs.

B. COST ANALYSIS
Let us consider the case of multiple inputs being evaluated on the same polynomial. The following lemma shows the complexity of the GKR protocol (precisely, the variants [14] or [16, Section 3]) on our circuit construction for such a case. (The complexity for the single-input case is an instance of that of the single-polynomial-multiple-inputs case.) Proof: Appendix B-E. Here we note that our proof generation cost is better than the previously best known result. Specifically, let C be the circuit described in Lemma 5, and C be a circuit that is equivalent to C with the same size O(MN ) and the same depth O(log N ), but is constructed in a standard way VOLUME 10, 2022 (i.e., computing all the powers of t using the exponentiationby-squaring method, computing all the monomials, and adding all the monomials in a binary tree fashion). Then, the proof generation cost of Giraffe [21] and Libra [16] on C are O (MN + N log N ) and O(MN ), respectively, while ours is O(M √ N + N ). Their other costs (i.e., circuit evaluation, verification, and communication) on C are the same with ours.

VI. COST OPTIMIZATION FOR ROUNDING OPERATION
In this section, we present an optimization technique that can significantly reduce the prover's cost for the rounding layers described in Section IV.

A. GALOIS RING OVER Z p e AND SAMPLING SET
A Galois ring Z p e [t]/(f (t)) over Z p e for a monic irreducible polynomial f (t) ∈ Z p [t] is a natural generalization of the Galois field GF(p n ) over a finite field F p . The representation of elements and operations in Z p e [t]/(f (t)) is similar to that of GF(p n ) modulo the difference between Z p e and F p . Let d be the degree of

Then, the dimension of Z p e [t]/(f (t)) is d, and each element is
represented as a d-dimensional tuple in Z d p e whose standard basis corresponds to 1, t, t 2 , . . . , t d−1 . Thus, the addition corresponds to the component-wise addition in Z d p e , and the multiplication by an element a = (a 0 , a 1 , . . . a d−1 ) corresponds to the matrix multiplication by its corresponding matrix according to the multiplication rule t · (a 0 , a 1 , . . . , a d−1 ) = (0, a 0 , a 1 , . . . , a d−2 ) − a d−1 · (f 0 , f 1 , . . . , f d−1 ).

is a valid sampling set for the generalized Schwartz-Zippel lemma (Lemma 3) as well as the generalized GKR protocol (Theorem 4).
Note that the cardinality of the sampling set A in Theorem 5 is p d p, which is maximal.. xvi Moreover, A satisfies the additional condition of Remark 1.

1) IRREDUCIBLE POLYNOMIALS IN Z p [t ]
To construct a Galois ring Z p e [t]/(f (t)), we need an irreducible polynomial in Z p [t]. Indeed, there exist many irreducible polynomials f (t) ∈ Z p [t] for any degree d, and a sparse polynomial (where most of its coefficients are zero) is desired for the efficiency of multiplication in Z p e [t]/(f (t)). xvi A set containing more than p d elements has distinct elements x and y such that x − y = (n 0 p, n 1 p, . . . , n d−1 p) ∈ Z p e [t]/(f (t)) by the Pigeonhole principle where n i 's are integers, and (n 0 p, n 1 p, . . . , n d−1 p) is a zero-divisor.
We provide examples of such sparse irreducible polynomials (Lemma 6) in Appendix B-G.
Lemma 6: Let p be a prime number. All of the following polynomials are irreducible in Z p : − a for some a when p ≡ 1 mod 3. 5) x 4 − 2 when p ≡ 5 mod 8. 6) x 4 − 3 when p ≡ 5 mod 12. Proof: Appendix B-G

B. OPTIMIZATION OF PROVER'S COST FOR ROUNDING LAYERS
Now we explain how to optimize the prover's cost for the rounding layers. Let C p be a given approximate arithmetic circuit over Z p e , and q be a prime such that p q d . First, we convert C p to an approximately equivalent circuit C q over Z q de , by the base-p-to-base-q conversion, where each basep rounding gate (x → x/p ) in C p is replaced with dconsecutive base-q rounding gates (x → x/q ) in C q . Then, we apply the generalized GKR protocol over a Galois ring Here, we employ the sampling set given in Theorem 5, whose cardinality is q d p, which affects the soundness. Moreover, in the process of the protocol, we have the circuit evaluation to be performed over Z q de , and the proof generation and the verification to be conducted over Z q de [t]/(f (t)). This is valid, since Z q de [t]/(f (t)) naturally embeds Z q de as constant terms. Now we analyze the complexity of the protocol for a rounding layer that consists of r rounding gates. First, note that the degree of the rounding polynomial (ldr) of C p is ep, while that of C q is deq de d √ p, which is much smaller than ep for some d. On the other hand, the cost of the individual addition (and multiplication) operation in Z q de [t]/(f (t)) is O(d) (and O(d 2 ), resp.) times larger than that of Z p e . Based on these facts and Lemma 5, the complexity of the unoptimized protocol on C p and the optimized protocol on C q can be summarized as follows (the two are equivalent when d = 1): Here the optimization problem is to find d such that the costs for C q are minimized. In particular, given p, the term d 4 d √ p is minimized to ((e ln p)/4) 4 , which is much smaller than p, when d = (ln p)/4, where e is Euler's number. In Section VII, we will present an experimental result where two orders of magnitude cost reduction was made by finding a proper d.

VII. EXPERIMENTAL RESULTS
We present experimental results that quantify the efficiency of our scheme. Specifically, we conducted experiments that show how efficiently our scheme support rounding, and how effective the optimization technique is. Also, to show the importance of rounding, we compare our scheme (with rounding) to the original GKR protocol (without rounding) on deeply nested matrix multiplications. We consider matrix multiplication since it is a well-experimented subject considered by all of the existing GKR protocol variants, making it easier to compare with them. More importantly, matrix multiplication constitutes about 90% of DNN training workloads [61].

A. EXPERIMENTAL SETUP
We implemented our generalized GKR protocol xvii over a ring R = Z p e [t]/(f (t)) where f (t) is a monic irreducible polynomial over Z p e . The modulo operations of Z p e are implemented using the Montgomery modular multiplication [62]. The code is written in C++11 using the GMP library, and compiled with the LLVM GCC compiler 9.1.0 (with -O3). The source code is available. xviii All the experiments were performed on a laptop machine with Intel Core i5 CPU running MacOS (64-bit) at 2.9GHz processor and 8GB memory. Throughout this section, we report the verification cost excluding the cost of evaluating MLE of input/output layers, since they are not involved in verifying rounding layers placed in the middle of a circuit.

B. EFFECTIVENESS OF OPTIMIZATION VIA GALOIS RING
To show the effectiveness of the optimization technique described in Section VI, we instantiated our scheme with different Galois rings and compared their performance. Specifically, given an original ring, R 1 = Z (65537) 7 , we took two Galois rings, R 2 = Z (271) 14 [t]/(x 2 + 1) and R 3 = Z (17) 28 [t]/(x 4 − 3), where |R 1 | |R 2 | |R 3 | 2 112 . Then, we instantiated our optimized protocol (Section VI) with the three different rings, and experimented with them for a rounding layer that consists of 2 14 rounding gates, where each rounding gate performs, roughly speaking, the 16-bit rounding, i.e., truncating the least-significant 16 bits. xix Figure 2 shows the performance of the protocol over the different rings. The circuit evaluation cost drastically decreases as the dimension of a Galois ring increases. This is because the size of the rounding circuit for R 3 is much smaller than that of R 1 , since the size depends on ep. However, the proof generation cost is not the case, since the cost of individual ring operations quadratically increases as the dimension of a Galois ring increases, thus it offsets the xvii Specifically, the generalization was made on top of Thaler's variant [14], since we considered Thaler's variant to compare ours to the original GKR protocol as explained in Section VII-D. xviii https://anonymous.4open.science/r/GKR_approx_code-35AF xix More precisely, each rounding gate takes as input x, and outputs x/65537 , x/(271 2 ) , and x/(17 4 ) , respectively, for each R 1 , R 2 , and R 3 . benefit of a smaller rounding circuit when the dimension is too high. In our experimental setup, the protocol over R 2 of dimension two performed best in generating proofs. On the other hand, the verification cost increases as the dimension of a Galois ring increases, since the verification cost logarithmically depends on the rounding circuit size, thus the benefit of a smaller rounding circuit is insignificant, but the cost of individual ring operations dominates. In general, the optimal dimension varies depending on the set of parameters of the protocol and the characteristics of computation of interest. Also, we note that the circuit evaluation cost does not involve the cost overhead of individual operations of a Galois ring, since the circuit evaluation is performed over a base ring Z p e instead of its Galois ring Z p e [t]/(f (t)), as mentioned in Section VI. This is why the proof generation cost is bigger than the circuit evaluation cost when the dimension is greater than one, although our optimal circuit construction offers the proof generation cost that is asymptotically smaller than the circuit evaluation cost, as described in Section V-B.

C. EFFICIENCY OF VERIFIABLE ROUNDING OPERATION
To quantify the efficiency of our scheme for rounding, we applied our scheme for a single rounding layer that consists of multiple rounding gates. Specifically, we consider our generalized GKR protocol over R 2 = Z (271) 14 [t]/(x 2 + 1), and the rounding operation x → x/(271 2 ) , roughly the  16-bit rounding. Figure 3 shows the performance of our protocol for a rounding layer of various sizes, from 2 8 to 2 19 . As described in Section V-B, the cost of circuit evaluation and proof generation is linear in the number of rounding gates, while the cost of verification and communication is logarithmic in the number of rounding gates. We also note that the verification becomes even faster than the native evaluation (i.e., performing the rounding operation directly in the native processor, without going through the arithmetic circuit) when the number of rounding gates is more than 2 18 .
Moreover, we tested the performance of our scheme for rounding on various rounding bits. In Figure 4, the proof generation and circuit evaluation time for rounding, as well as verification time is described for varying rounding bits from 8bits to 32bits. We can see that the prover time depends linearly on the rounding bit-size while the verifier time remains almost constant, which coincide with theoretical estimation since the process of rounding 8n bits is composed of n repeated evaluation of low digit removal polynomial each of which performs 8bit rounding (the verifier time remains constant since it depends logarithmically on the circuit size).

D. COMPARISON TO ORIGINAL GKR PROTOCOL ON ROUNDING ARITHMETIC
Now we compare our protocol (that supports rounding) to the original GKR protocol (that does not support rounding) on deeply nested matrix multiplications. The most important value of rounding is that it controls the number of digits within the limit of the underlying system, which is especially necessary for AI computations. This is the most fundamental advancement of our approach, compared to the original GKR. Moreover, in order to understand the end-to-end performance of our approach, we conducted a performance comparison with the original GKR as follows.
We considered the Thaler [15]'s implementation for the original GKR protocol since it shows the best performance for matrix multiplication among other variants (e.g., [16], [21]). To be a fair comparison, we modified the Thaler's implementation to employ the same GMP library we used in our protocol implementation. xx

1) DATASET AND COMPUTATION
For comparison, we consider a nested multiplication of depth n, (· · · ((M 2 ) 2 ) 2 · · · ) 2 = M 2 n , where M is a 128 × 128 matrix whose elements are fixed-point numbers with 16 fractional bits (i.e., 16 bits below the decimal point), and no overflow occurs during the computation. xxi As an input, we randomly generated numbers in [0, 1) with 16bit fractional bits.
In the original GKR protocol (over a finite field Z q ) that does not support rounding, the above nested multiplication over the fixed-point numbers is represented as the integer-scaled nested multiplication, i.e., (· · · (((2 16 M ) 2 ) 2 ) 2 · · · ) 2 = (2 16 ) 2 n M 2 n . This means that the prime q must be taken to be larger than (2 16 ) 2 n , that is, the bit-size of field elements (in Z q ) exponentially grows in the multiplication depth n. In contrasts, in our protocol (over a ring Z p e ), the nested multiplication is represented as the integer-scaled nested multiplication with rounding, i.e., (· · · ( ( (2 16 M ) 2 ) 2 ) 2 · · · ) 2 2 16 M 2 n , where · denotes x → x/(2 16 ) . Thus p e can be only larger than 2 16 · 2 16n (the additional term 2 16n is due to the modulus change by rounding as described in Remark 2). That is, the bit-size of ring elements (in Z p e ) is linear in the multiplication depth.
xx While we experimented with matrix multiplication, we considered Thaler's general-purpose machinery instead of the special-purpose scheme for matrix multiplication, for the generality of experimental results.
xxi For simplicity, we consider M such that the elements of M and M 2 n are positive fixed-point numbers less than 1, i.e., being represented in 16 bits. In Figure 5, we compare the performance of our protocol to that of Thaler's on nested matrix multiplication of different depths. To highlight the net effect of rounding, we report the cost for a single matrix multiplication in the context of different multiplication depths. That is, the cost for the entire nested multiplication is the one in Figure 5 multiplied by the number of matrix multiplications.
2) IMPROVEMENT Figure 5 shows that the cost of Thaler's exponentially increases in the multiplication depth, while ours is linear in the depth. When the multiplication depth is small (e.g., depth 6), the cost of our protocol could be bigger than Thaler's, due to the overhead of rounding. However, when the multiplication depth is greater than a certain amount (e.g., depth 8), ours is much better than Thaler's: roughly, 10× and 100× faster proving (and verification) time when depth is 10 and 12, respectively, and the difference will increase exponentially as the depth increases. This experimental result confirms that it is critical to support the rounding operation for verifiable computing of an approximate arithmetic circuit with a large multiplication depth.

E. DISCUSSION
The soundness probability of our protocol in Figure 5 is set to 2 −14 which is much worse than that of Thaler's compared one ( 2 −1200 or less). However, optimizing ours to have similar soundness condition could be an overkill. In fact, our soundness can be quickly improved by running iterations, e.g., four prover-verifier pairs in parallel will achieve (2 −14 ) 4 = 2 −56 < 10 −16 soundness, xxii which is similar to xxii For comparison, 10 −16 is the uncorrectable bit error rate of a typical hard disk [63]. the soundness probability (2 −45 to 2 −20 [15], [34]) of existing verifiable computing systems. xxiii We can compare our result with GKR protocol on boolean representation, i.e., representing all operations in bitwise so that rounding is also representable efficiently. In fact, our method incurs blow up of cost quadratic in multiplicative depth and rounding bits for each rounding gate, while boolean representation incurs that quadratic in input bits for each multiplication gate. Therefore, our method is efficient when number of rounding gates is smaller than that of the multiplication gates, i.e., when lazy rounding strategy is applicable. Also, our method can be applied with (asymptotic) cost reduction technique derived from higly regular wiring pattern, e.g. Thaler [14]'s protocol for matrix multiplication or ours for polynomial evaluation. In contrasts, it seems quite hard to apply them for Boolean representation of such circuit.
We can also compare our result with recent work [12], [26] which combine commitment scheme with interactive proof. In these cases, prover's cost for rounding gate is dominated by that of the commitment with as many messages as bitsize of the input. Though it seems (asymptotically) efficient than ours in the bitsize, the cost of commitment shows that ours can be better or comparable in some cases. The experimental result of [12] on commitment implies that proving rounding gate with their commitment would require at least 0.11× rounding bits (ms) per gate, xxiv resulting in 2 14 × 16 × 0.11 (ms) 29 (sec) for 2 14 gates, which is about 7 times costly than our prover (4 sec) in Figure 3. On the other hand, the commitment in [26] xxv is asymptotically costly in verifier cost than that of ours.
We finally note that there is still room for improvement of ours. In particular, the individual operations of a Galois ring can be further improved, e.g., they can be broken down into multiple independent subroutines, being suitable for parallelization [64], [65]. It will drastically reduce the overhead of increasing the dimension of a Galois ring, which in turn will allow us to employ a much smaller prime p, further improving the overall performance of the protocol.

VIII. CONCLUSION
We presented a verifiable computing scheme that supports rounding which is essential for approximate computations. Based on the (latest variant of) GKR protocol that is most efficient in generating proofs among existing verifiable computing protocols, our scheme consists of the following elements: generalization of the GKR protocol over a ring, reduction of the rounding operation to a low-degree polynomial in a ring, optimal circuit construction of arbitrary polynomials, and optimization of proof generation for rounding via a Galois ring. We implemented our scheme, and presented xxiii Some argument systems [10], [39], [40]  experimental results that show the efficiency of our scheme for approximate computations. For example, ours performed two orders of magnitude better than the existing GKR protocol for a nested matrix multiplication of depth 12 on the 16-bit fixed-point arithmetic.

A. LIMITATIONS AND FUTURE WORKS
Though we have reduced the proving complexity of rounding operation significantly, it is not sufficient for being practical. For example, while it requires roughly 1 sec for proving 2 12 rounding gates (with 16-bit fractional bits), it is roughly 10 5 × slower than evaluation of rounding gates, whose overhead can be problematic when proving lots of rounding gates. Still, we remark that this overhead may be unavoidable for interactive proof systems where representation of rounding operations with arithmetic operations is necessary (and our representation is the simplest one possible).
On the other hand, if we allow argument systems utilizing cryptographic commitment schemes, rounding operations can be represented with bit representations, which results in (significant) overhead proportional to the bit length due to the cryptographic schemes. We remark that, such overhead is inevitable in a finite field for which most commitment schemes are supporting.
One interesting future work will be combining commitment schemes over rings (e.g., Z 2 k or Z p e ) with our solution. Very recently, some works [66], [67] have come up with such commitment schemes or ZK proof systems, and combining our result with one of them may result in more efficient and practical solution for verifiable computation over approximate numbers (with rounding operations).
We finally remark that our solution -verifiable computation over rings -has been found out to be applicable in completely different context, e.g., verifiable computation of encrypted computation. More precisely, computations over the ciphertexts encrypted by (fully) homomorphic encryption [68], [69] can be represented concisely with ring operations, and one can prove and verify the computation over ciphertexts with better flexibility (hence better efficiency by choosing more efficient parameters) utilizing our solution. We refer to [70] for the detail.

APPENDIX A CIRCUIT CONSTRUCTION FOR VERIFIABLE POLYNOMIAL EVALUATION
A. NOTATION Assume we are given a polynomial g over a finite ring Z p e . (Our representation is also valid with a polynomial over a finite field F.) Let us fix N = 2 2n to denote the smallest power of four such that N ≥ deg(g). Let us index each layer where the input layer is indexed by 0. xxvi Let us also index each gate in a layer where the left-most gate is indexed by 0, and xxvi In the GKR protocol, the output layer is indexed by 0. the index value is represented in the binary form. We writẽ V i to denote the MLE of the output values of the i th layer as usual. For the simplicity of the presentation, we assume that the number of inputs denoted by M = 2 m is a power of two, in multi-input case. We write β s (x,

B. DESCRIPTION
Now we present the circuit representation for the polynomial g(t) = N i=0 a i t i . The circuit is composed of four parts, each of which is called polygen, eval, unify, and extract, respectively, as illustrated in Figure 1. We note that, as we will explain below, the eval and unify layers consist of two sub-circuits placed in parallel, where the left-hand side sub-circuit computes the sub-polynomials g i and g i,j , while the right-hand side one computes the power terms t i . Although the two sub-circuits compute different types of values, we design them to have the identical wiring pattern by introducing the dummy gates (i.e., the gates computing zero), so that the overall circuit becomes regular, allowing the verifier to be efficient. Here, the dummy gates affect only the width of the circuit, not the depth, and thus their effect on the verifier's cost is negligible, i.e., asymptotically zero, as the verifier's cost is logarithmically proportional to the circuit width. We first describe the single-input case ( Figure 1).
Note that the above equation makes no distinction between the two sub-circuits, i.e., one that computes g 1,2 (t), . . . , g i−1,i (t) and another that computes 0, . . . , t 2j , which significantly reduces the prover's cost that otherwise would have been very large. This is achieved by introducing the dummy gates that compute zero, as explained earlier.
Finally, the extract layer, i.e., the (2n+2) th layer, takes two inputs (g(t) − a 0 ) and t N , and simply returns g(t) by adding the constant a 0 to the first input. The relation is as follows: The multi-input case with M = 2 m number of inputs follows naturally from single-input case described so far (see Figure 6).

A. PROOF OF THEOREM 2
Proof: [Sketch] Completeness directly follows from the protocol description. The main idea for showing soundness can be summarized as follows (See [4] or [58] for the full proof). Assume that a (dishonest) P sends an incorrect result S(f ) = S(f ) to V. Let us distinguish the values claimed by P from the values which would be claimed by an honest P by adding the prime ( ) symbol. Then f 1 (t) = f 1 (t). Otherwise, V will reject immediately by checking if S(f ) = f 1 (0) + f 1 (1) . When V chooses a random r 1 from F, by the Schwartz-Zippel lemma (Lemma 1), f 1 (r 1 ) = f 1 (r 1 ) with the high probability (1 − d |F| ) since f 1 (t) is a polynomial of degree at most d. If f 1 (r 1 ) = f 1 (r 1 ), P must send f 2 (t) = f 2 (t) because of the same reasoning as before. Continuing this, P must send f ν (t) = f ν (t) = f (r 1 , . . . , r ν−1 , t), and will be rejected with the high probability by V who finally checks if f ν (r ν ) = f (r 1 , . . . , r ν ) for a randomly chosen r ν in F. The soundness probability bound is derived from the probability 1 − (1 − d |F| ) ν that at least one of the above high probability events does not occur during the protocol.

B. PROOF OF LEMMA 2
Proof: Uniqueness follows from the observation that any multilinear polynomialṼ (x 1 , x 2 , . . . , x µ ) can be VOLUME 10, 2022 represented by b∈{0,1} µ C(b)x b , where x b := i∈I x i with I := {i | b i = 1}, and C(b) ∈ F is a coefficient corresponding to each monomial x b . Then, C(b) is uniquely determined byṼ (b)'s for b ∈ {0, 1} µ . Specifically, for a zero vector 0, C( 0) =Ṽ ( 0). For an elementary vector e i whose i-th component is 1 and all others are 0, C(e i ) =Ṽ (e i ) − C( 0). For a vector e i,j ∈ {0, 1} µ (i = j) whose i-th and j-th components are 1 and all others are 0, C(e i,j ) =Ṽ (e i,j ) − C(e i )−C(e j )−C( 0). Continuing this process with increasing the weight of each vector b ∈ {0, 1} µ , we can see that every Proof: It follows from the induction on the number of variables n as the original Schwartz-Zippel lemma (Lemma 1), provided that it holds in the single variable case. Let a 1 ∈ A be a root of f (t). By the division algorithm with a monic polynomial (t − a 1 ), f (t) = (t − a 1 )f 1 (t) and the degree of f 1 (t) is less than that of f (t). Note that another root, if exists, a 2 ∈ A (a 2 = a 1 ) must be a root of f 1 (t) since (a 2 −a 1 ) is not a zero divisor and f (a 2 ) = 0. Then, the division algorithm with a monic polynomial (t − a 2 ) on f 1 (t) gives f (t) = (t − a 1 )(t − a 2 )f 2 (t) and the degree of f 2 (t) is less than that of f 1 (t). Continuing this process, we conclude that f (t) cannot have more roots in A than the degree of f (t).

D. PROOF OF THEOREM 3
Proof: The proof is almost the same as that of the original Sum-Check protocol. The gerenalized Schwartz-Zippel lemma (Lemma 3) implies that any two distinct univariate polynomials of degree ≤ d over R agree on at most d points among A. Following the proof of the original Sum-Check protocol (Theorem 2), the soundness probability of the generalized Sum-Check protocol is bounded by nd |A| .

E. PROOF OF LEMMA 5 (PROVER COST)
Proof: We follow the notation of Appendix A. The circuit representation is composed of four parts; polygen, eval, unify, extract, and division as described before, and the depth is 2n + 3 = O(log N ). We first estimate the cost of P for evaluating the circuit. It is simply M times of the cost for evaluating the circuit of a single polynomial evaluation, and we only estimate the single case (Fig.1). The i-th layer in polygen requires 2 i−1 multiplications resulting in O(2 n ) total for polygen part. The eval layer requires O(2 n · 2 n+1 ) = O(2 2n ) operations, since evaluating each f i (t) given {t j } 2 n j=1 requires O(2 n ) operations. The j-th layer in unify requires 2 2n+3−j operations resulting in O(2 n ) total for unify part.
Since extract and division part is of negligible cost, the total cost for evaluation is O(2 n + 2 2n + 2 n ) = O(N ), resulting in O(NM ) for M rounding gates. Now we estimate the cost of P for proving the evaluation given all output of gates in the circuit. We assume Thaler [15]'s Reusing Work reducing P's cost for evaluating all β m (w, p),Ṽ (q), and α(r) values required for sum-check to be only O(2 m ), O(2 s ), and O(2 t ) respectively, where m, s, and t are the number of variables constituting p, q, and r, respectively. xxvii Thus, for estimation of the cost, it suffices to count the number of variables appear in summands of the relation of MLEs in multi rounding case ( Figure 6).
In polygen part, reducing fromṼ i+1 toṼ i requires O(2 m + 2 m+i ) cost for sum-check, and additional O(i · 2 i ) cost for reducing to single point, resulting in total O(2 m+n + n · 2 n ) cost. In eval layer, sum-check requires O(2 2n+1 + 2 m+n ) cost. In unify part, reducing fromṼ j+1 toṼ j requires O(2 m + 2 m+2n+2−j ) cost for sum-check, and additional O((2n + 2 − j) · 2 2n+2−j ) cost for reducing to single point, xxviii resulting in total O(2 m+n + n · 2 n ) cost. The extract and division layer doesn't affect P's cost since it does not require sum-check.
Overall, the cost of proving is O(2 m+n + 2 2n + 2 m+n + n · 2 n ) which is O(

F. PROOF OF LEMMA 5 (VERIFIER COST)
Proof: We follow the notation of Appendix A. Note that α(z, q) can be precomputed in cost O(N ), using memoization [6], and will not be considered in the following estimation. Each β m can be evaluated in cost O(m) due to its simple form [15,Section 4.3.1], without affecting the asymptotic cost of V. Also, as the original GKR protocol, V's cost for the initial an final step is O(M log M ), since there are O(M ) input and output. Now, we can estimate the cost of V based on that in the sum-check (Theorem 2). Recall that in sum-check, the cost of V depends on the number of variables managed by summation. In polygen layers, reducing fromṼ i+1 toṼ i requires V to perform O(m) operations for sum-check, and O(i) for reducing to single point. Therefore, the cost for polygen layers is O(mn + n 2 ). In eval layer, O(n) cost is required. In unify layers, reducing fromṼ j+1 toṼ j requires V to perform O(m) operations for sum-check, O(2n + 2 − j) for reducing to single point, resulting in O(mn + n 2 ) cost total. Since the cost for extract and division layers are negligible, the total cost of V without initial and final step is O(mn + n 2 ) = O(log N log MN ). The bound of soundness probability and communication cost can be estimated similarly.

G. PROOF OF LEMMA 6
We use following Lemma whose proof can be found in [71].
Lemma 7 [71,Lemma 5.9]]: An n-th cyclotomic polynomial n of degree ϕ(n) is irreducible if and only if p is a primitive root modulo n (i.e., p does not divide n), and its multiplicative order modulo n is ϕ(n), where ϕ is the Euler's totient function.
Proof of Lemma 6: (i), (ii), (iii) directly follows from the above lemma and the fact that each prime p is a primitive root modulo 4, 5, or 9, respectively.
More algebraic proof can be found in [72]. For (iv), note that if x 3 − a is reducible, it has monic factor and x 3 − a has a solution in Z p . We show that there exists an a such that x 3 − a has no solution in Z p which is equivalent to the claim that the function t → t 3 : Z p → Z p is not injective. Note that the multiplicative group Z × p of Z p has order p − 1, and the order is multiple of 3 when p ≡ 1 mod 3. Now, by Sylow theorem, there exists a group of order 3 in Z × p , and there exists at least 3 elements in Z p whose cube is 1. Therefore, the claim follows.
Note that above Lemma 7 implies that we can find many irreducible cyclotomic polynomials (with few non-zero coefficients) of higher degree, if needed.

ACKNOWLEDGMENT
This work is a revised version of the thesis ''Verifiable Computing for Approximate Arithemtic'' [1] with additional experiments, source codes distribution, and further discussions on limitations and future works. This work was done when Dongwoo Kim was with the Department of Mathematical Sciences, Seoul National University. Since 2020, he has been a Principal Engineer in security and cryptography at Western Digital Research, Milpitas, CA, USA. Before that, he has been a Researcher at the Industrial and Mathematical Data Analytics Research Center, Seoul National University. His research interests include the improvement of homomorphic encryption, verifiable computation, and other cryptographic primitives for practical applications.
DAEJUN PARK received the B.S. and M.S. degrees from Seoul National University and the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign.
He is currently a Senior Blockchain Security Engineer at a16z crypto, developing formal methods and tools for web3 security to help portfolio companies in particular and the web3 community in general to raise their security bar. Prior to joining a16z crypto, he was the Director of Formal Verification at Runtime Verification, a research-based startup offering formal verification tools and services, where he led a team of formal verification engineers and security auditors for smart contracts and consensus protocols. Early in his career, he was a Founding Member of another tech startup Sparrow, where he designed and implemented static program analysis tools that detect memory safety errors for embedded systems software.
Dr. Park's work on Language-Parametric Program Verification received the Distinguished Paper Award at OOPSLA'16. VOLUME 10, 2022