DeepSplit: Scalable Verification of Deep Neural Networks via Operator Splitting

Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non-convex optimization problem, for which several past works have proposed convex relaxations as a promising alternative. However, even for reasonably-sized neural networks, these relaxations are not tractable, and so must be replaced by even weaker relaxations in practice. In this work, we propose a novel operator splitting method that can directly solve a convex relaxation of the problem to high accuracy, by splitting it into smaller sub-problems that often have analytical solutions. The method is modular, scales to very large problem instances, and compromises operations that are amenable to fast parallelization with GPU acceleration. We demonstrate our method in bounding the worst-case performance of large convolutional networks in image classification and reinforcement learning settings, and in reachability analysis of neural network dynamical systems.

networks [19]. One line of work studies computationally cheaper but looser bounds of the LP relaxation [7], which we denote as linear bounds, and have been extended to larger and more general networks and settings [20,11,12,21]. These bounds tend to be loose unless optimized during training, which typically comes at a significant cost to standard performance. Further work has aimed to tighten these bounds [22,23,24], however these works focus primarily on small convolutional networks and struggle to scale to more typical deep networks. Other work has studied the limits of these convex relaxations on these small networks using vast amounts of CPU-compute [19]. Recent SDP-based approaches [25] can produce much tighter bounds on these small networks.
Lagrangian-based bounds Related to our work is that which solves the Lagrangian of the LP relaxation [13,14], which can tighten the bound but do not aim to solve the LP exactly due to relatively slow convergence. However, these works primarily study small networks whose LP relaxation can still be solved exactly with Gurobi. Although these works could in theory be used on larger networks, only the faster, linear bounds-based methods [21] have demonstrated applicability to standard deep learning architectures. In our work, we solve the LP relaxation exactly in large network settings that previously have only been studied with loose bounds of the LP relaxation such as LiRPA [21].
Operator splitting methods Operator splitting, and in particular the ADMM method, is a powerful technique in solving structured convex optimization problems and has applications in numerous settings ranging from optimal control [26] to training neural networks [27]. These methods scale well with the problem dimensions, can exploit sparsity in the problem data efficiently [28], are amenable to parallelization on GPU [29], and have well-understood convergence properties under minimal regularity assumptions [15]. The benefit of ADMM as an alternative to interior-point solvers has been shown in various classes of optimization problems [30]. Our operator splitting method is specifically tailored for neural network verification in order to fully exploit the problem structure.
Complete verification methods These methods verify properties of deep networks exactly (i.e., they find the optimal value J and a global solution x ) using methods such as Satisfiability Modulo Theories (SMT) solvers [31,18,32] and MILP solvers [3,4,5,6]. Complete verification methods typically rely on BaB algorithms [33], in which the verification problem is divided into subproblems (branching) that can be verified using incomplete verification methods (bounding) [34]. However, these methods have a worst-case exponential runtime and have difficulty scaling beyond relatively small convolutional networks. Motivated by this, several recent works have been reported on improving the scalability of BaB methods by developing custom solvers in the bounding part to compute cheaper intermediate bounds and hence, speed up their practical running time [14,35,36,17]. In the same spirit, our proposed method for solving the large-scale LP relaxations can be potentially used as a bounding subroutine in BaB for complete verification which we leave for future research.
Notation We denote the set of real numbers by R, the set of nonnegative real numbers by R + , the set of real n-dimensional vectors by R n , the set of m × n-dimensional real-valued matrices by R m×n , and the n-dimensional identity matrix by I n . The p-norm (p ≥ 1) is denoted by · p : R n → R + . For a set S, we define the indicator function I S (x) of S as I S (x) = 0 if x ∈ S and I S (x) = +∞ otherwise. Given a function f : X → Y, the graph of f is the set G f = {(x, f (x)) | x ∈ X }.

Neural network verification via operator splitting
We consider an -layer feed-forward neural network f (x) : R n 0 → R n described by the following recursive equations, where x 0 ∈ R n 0 is the input to the neural network, x k ∈ R n k is the input to the k-th layer, and φ k : R n k → R n k+1 is the operator of the k-th layer, which can represent any commonly-used operator in deep networks, such as linear (convolutional) layers 1 , MaxPooling units, and activation functions.
Given the neural network f , a specification function J : R n → R, and an input set X ⊂ R n 0 , we say that f satisfies the specification J if J(f (x)) ≥ 0 for all x ∈ X . This is equivalent to verifying that the optimal value of (1) is non-negative. We assume X ⊂ R n 0 is a closed convex set and J : R n → R ∪ {+∞} is a convex function.
Using the sequential representation of the neural network in (3), we may rewrite the optimization problem in (1) as the following constrained optimization problem, with n := k=0 n k decision variables x 0 , · · · , x . We can rewrite (4) equivalently as where is the graph of φ k . Here x k andx k are a priori known bounds on x k when x 0 ∈ X 2 . The problem in (5) is non-convex due to presence of nonlinear operators in the network, such as activation layers, which render the set G φ k nonconvex. By over-approximating G φ k by a convex set (or ideally by its convex hull), we arrive at a direct layer-wise convex relaxation of the problem, in which each two consecutive variables (x k , x k+1 ) are sequentially coupled together. Solving this relaxation directly cannot exploit this structure and is hence unable to scale to even medium-sized neural networks [19]. In this section, by exploiting the sequential structure of the constraints, and introducing auxiliary decision variables (variable splitting), we propose a reformulation of (4) whose convex relaxation can be decomposed into smaller sub-problems that can be solved efficiently and in a scalable manner. 1 Convolution is a linear operator and conceptually it can be analyzed in the same way as for linear layers. 2 See Section 2.4.2 for comments on finding the bounds x k andx k . Figure 1: Illustration of the network structure (top) and DeepSplit computation module for a generic layer (bottom). Adding identity layers in between the neural network layers decouples the variables x k and allows processing them independently.

Variable splitting
By introducing the intermediate variables y k and z k , we can rewrite (4) as which has now 3n − n 0 − n decision variables. Intuitively, we have introduced additional "identity layers" between consecutive layers (see Figure 1). By overapproximating G φ k by a convex set S φ k , we obtain the convex relaxation subject to y k = x k , for which J relaxed ≤ J . This form is known as consensus as y k and z k−1 are just copies of the variable x k . As shown below, this "overparameterization" allows us to split the optimization problem into smaller sub-problems that can be solved in parallel and often in closed form.
For the Lagrangian in (8), the dual function, which provides a lower bound to J relaxed , is given by g(λ, µ) = inf (x,y,z) L(x, y, z, λ, µ). The best lower bound can then be found by maximizing the dual function. 3 However, solving the inner problem jointly over (x, y, z) to find the dual function is as difficult as solving a direct convex relaxation of (4). Instead, we split the primal variables (x, y, z) into x and (y, z) and apply the classical ADMM algorithm to obtain the following iterations (shown in Figure 1) for updating the primal and dual variables, As we show below, the Lagrangian has a separable structure by construction that can be exploited in order to efficiently implement each step of (9).

The x-update
The Lagrangian in (8) is separable across the x k variables; hence, the minimization in (9a) can be done independently for each x k . Specifically, for k = 0, we obtain the following update rule for x 0 , Projections onto the ∞ and 2 balls can be done in closed-form. For the 1 ball, we can use the efficient projection scheme proposed in [37], which has O(n 0 ) complexity in expectation. For subsequent layers, we obtain the updates x + = arg min For convex J and ρ > 0, the optimization problem for updating x is strongly convex with a unique optimal solution. Indeed, its solution is the proximal operator of J/ρ evaluated at z −1 − µ −1 . For the special case of linear objectives, J(x ) = c x , we obtain the closed-form solution

The (y, z)-update
Similarly, the Lagrangian is also separable across the (y k , z k ) variables. Updating these variables in (9b) corresponds to the following projection operations per layer, for k = 0, · · · , − 1. Depending on the type of the layer (linear, activation, convolution, etc.), we obtain different projections which we describe below.

Affine layers
Suppose φ k (y k ) = W k y k + b k is an affine layer representing a fully-connected, convolutional, or an average pooling layer. Then the graph of φ k is already a convex set given by (11) takes the form The matrix (I n k + W k W k ) −1 can be pre-computed and cached for subsequent iterations. We can do this efficiently for convolutional layers using the fast Fourier transform (FFT) which we discuss in Appendix A.2 and A.3.

Activation layers
For an activation layer of the form φ(x) := [ϕ 1 (x 1 ) · · · ϕ n (x n )] , the convex relaxation of G φ is given by the Cartesian product of individual convex relaxations i.e., S φ = S ϕ 1 × · · · × S ϕn . For a generic activation function ϕ : R → R, suppose we have a concave upper boundφ and a convex lower bound ϕ on ϕ over an interval which turns out to be the convex hull of G(ϕ) whenφ and ϕ are concave and convex envelopes of ϕ, respectively. The assumed pre-activation bounds x andx used to relax the activation functions can be obtained a priori via a variety of existing techniques such as linear bounds [7,12,38] which propagate linear lower and upper bounds on each activation function (see Figure 2) throughout the network in a fast manner.
When either x ≥ 0 orx ≤ 0, the graph G ϕ of the ReLU function becomes a line segment which allows closed-form projection of a point.

The (λ, µ)-update
Finally, we update the scaled dual variables as follows, The DeepSplit Algorithm is summarized in Algorithm 1.

Convergence and stopping criterion
The DeepSplit algorithm converges to the optimal solution of the convex problem (7) under mild conditions. Specifically, when J is closed, proper and convex, and when the sets S k (convex outer approximations of the graph of the layers) along with X are closed and convex, we can resort to the convergence results of ADMM [15]. Convergence of our algorithm is formally analyzed in Appendix A.1.
Following from [15], for the LP relaxation (7) of a feed-forward neural network, the primal and dual residuals are defined as These are the residuals of the optimality conditions for (7) and converge to zero as the algorithm proceeds. A reasonable termination criterion is that the primal and dual residuals must be small, i.e. r p ≤ p and r d ≤ d , where p > 0 and d > 0 are tolerance levels [15,Chapter 3]. These tolerances can be chosen using an absolute and relative criterion, such as where p = n 0 +2 −1 i=1 n i +n , n = i=0 n i , abs > 0 and rel > 0 are absolute and relative tolerances. Here n is the dimension of x, the vector of primal variables that are updated in the first step of the algorithm, and p is the total number of consensus constraints.

Residual balancing for convergence acceleration
A proper selection of the augmentation constant ρ has a dramatic effect on the convergence of the ADMM algorithm. Large values of ρ enforce consensus more quickly, yielding smaller primal residuals but larger dual ones. Conversely, smaller values of ρ lead to larger primal and smaller dual residuals. Since ADMM terminates when both the primal and dual residuals are small enough, in practice we prefer to choose the augmentation parameter ρ not too large or too small in order to balance the reduction in the primal and dual residuals. A commonly used heuristic to make this trade-off is residual balancing [39], in which the penalty parameter varies adaptively based on the following rule: where µ, τ > 1 are given parameters. In our experiments, we set τ = 2, µ = 10 and found this rule to be effective in speeding up the practical convergence which is demonstrated numerically in Section 4.5.

Connection to Lagrangian-based methods
In this section, we consider verification of the feed-forward neural network (3) and draw connections to two related methods relying on Lagrangian relaxation. Specifically, we relate our approach to an earlier dual method [13] (Section 3.1), as well as a recent Lagrangian decomposition method [14] (Section 3.2). Overall, these approaches use a similar Lagrangian formulation, but the specific choices in splitting and augmentation of the Lagrangian result in slower theoretical convergence guarantees when solving the convex relaxation to optimality.

Dual method via Lagrangian relaxation of the nonconvex problem
Instead of splitting the neural network equations (4) with auxiliary variables, an alternative strategy is to directly relax (4) with Lagrangian multipliers [13]: where x = (x 0 , · · · , x ) and λ = (λ 0 , · · · , λ −1 ). This results in the dual problem g ← maximize g(λ), where the dual function is By weak duality, g ≤ J . The inner minimization problems to compute the dual function g(λ) for a given λ can often be solved efficiently or even in closed-form [13]. The resulting dual problem is unconstrained but non-differentiable; hence it is solved using dual subgradient method [13]. However, subgradient methods are known to be very slow with convergence rate O(1/ √ N ) where N is the number of updates [40], making it inefficient to find exact solutions to the convex relaxation. On the other hand, this method can be stopped at any time to obtain a valid lower bound.

Lagrangian method via a non-separable splitting
Another related approach is the Lagrangian decomposition method from [14]. To decouple the constraints for the convex relaxation of (4), this approach can be viewed as introducing one set of intermediate variables y k as copies of x k to obtain x 0 ∈ X This splitting is in the spirit of the splitting introduced in [14,34], 4 and differs from our splitting which uses two sets of variables in (6). By relaxing the consensus constraints y k = x k , the Lagrangian is Again the Lagrangian is separable and its minimization results in the following dual function Since the dual function is not differentiable, it must be maximized by a subgradient method, which again has an O(1/ √ N ) rate. To induce differentiability in the dual function and improve speed, [14] uses augmented Lagrangian. Since only one set of variables was introduced in (16), the augmented Lagrangian is no longer separable across the primal variables. Therefore, for each update of the dual variable, the augmented Lagrangian must be minimized iteratively. To this end, [14] uses the Frank-Wolfe Algorithm in a block-coordinate as an iterative subroutine. However, this slows down overall convergence and suffers from compounding errors when the sub-problems are not fully solved. When stopping early, the primal minimization must be solved to convergence in order to compute the dual function and produce a valid bound.
In contrast to the approach described above, in this paper we used a different variable splitting scheme in (6) that allows us to fully separate layers in a neural network. This subtle difference has a significant impact: we can efficiently minimize the corresponding augmented Lagrangian in closed form, without resorting to any iterative subroutine. Specifically, we use the ADMM algorithm, which is known to converge at an O(1/N ) rate [41]. In summary, our method enjoys an order of magnitude faster theoretical convergence, is more robust to numerical errors, and has minimal requirements for parameter tuning. We remark that during the updates of ADMM, the objective value is not necessarily a lower bound on J , and hence, one must run the algorithm until convergence to produce such a bound. To stop early, we can use a similar strategy as the Frank-Wolfe approach from [14] and run the primal iteration to convergence with fixed dual variables in order to compute the dual function, which is a lower bound on J .

Experiments
The strengths of our method are (a) its ability to exactly solve LP relaxations and (b) do so at scales. To evaluate this, we first demonstrate how solving the LP to optimality leads to tighter certified robustness guarantees in image classification and reinforcement learning tasks (Section 4.1) compared with the scalable fast linear bounds methods. We then stress test our method in both speed and scalability against a commercial LP solver (Section 4.2) and in the large network setting, e.g., solving the LP relaxation for a standard ResNet18 (Section 4.3). In Section 4.4, we apply our method on reachability analysis of neural network dynamical systems and compare with the state-of-the-art complete verification method α, β-CROWN [17]. Section 4.5 demonstrates the effectiveness of residual balancing in our numerical examples.
Setup In all the experiments, we focus on the setting of verification-agnostic networks which are networks trained without promoting verification performances, similar to [25]. All the networks have been trained adversarially with the projected gradient descent (PGD) attack [42].
In all of the test accuracy certification results reported in this paper, we initialize ρ = 1.0 and apply residual balancing when running ADMM. In different experiments, the stopping criterion parameters abs , rel of ADMM are chosen by trial-and-error to achieve a balance between the accuracy and the runtime of the algorithm. In all these experiments the objective functions are linear [7]. Full details about the network architecture and network training can be found in Appendix A.4.

Improved bounds from exact LP solutions
We first demonstrate how solving the LP exactly with our method results in tighter bounds than prior work that solve convex relaxations in a scalable manner. We consider two main settings: certifying the robustness of classifiers for CIFAR10 and deep Q-networks (DQNs) in Atari games.
In both experiments, the pre-activation bounds for formulating the LP verification problem are obtained by the fast linear bounds developed in [7].

CIFAR10
We consider a convolutional neural network (CNN) of around 60k hidden units whose convex relaxations cannot be feasibly solved by Gurobi (for LP relaxation) or SDP solvers (for the SDP relaxation). Up until this point, the only solutions for large networks were linear bounds [20,21] and Lagrangian-based specialized solvers [13,14] to the LP relaxation.
In Table 1, we report the certified accuracy of solving the LP exactly with ADMM in comparison to a range of baselines. Verification of the CNN with ∞ perturbation at the input image with different radii is conducted on the 10, 000 test images from CIFAR10, and certified test accuracy is reported as the percentage of verified robust test images by each method. We compare with methods that have previously demonstrated the ability to bound networks of this size: fast bounds of the LP (Linear) [20,21], and interval bounds (IBP) [43]. We additionally compare to a suite of Lagrangian-based baselines, whose effectiveness at this scale was previously unknown. These methods 5 include supergradient ascent (Adam) [34], dual supergradient ascent (Dual Adam) [13] and a variant thereof (Dual Decomp Adam) [34], and a proximal method (Prox) [14]. As mentioned in Section 3.2, these baselines require solving an inner optimization problem through the iterations of the algorithms. In the experiments of Table 1, the number of iterations of different algorithms are bounded separately such that each Lagrangian-based method has an average runtime of 9 seconds to finish verifying one example. Our ADMM solver averages 9 seconds runtime per example in this verification task, which is the same as the average runtime of the Lagrangian-based methods, with the stopping criterion of abs = 10 −4 , rel = 10 −3 and ρ initialized as 1.0. Table 1: Certified test accuracy (%) of PGD-trained models on CIFAR10 through ADMM, the Lagrangian decomposition methods [13,34], and fast dual/linear [20,21] or interval bounds [43]. All the LP-based methods (ADMM and Lagrangian) are given the same verification time budget of 9s per example. The fast linear and IBP bounds can be obtained almost instantly in this experiment, and their achievable certified test accuracy is used for reference. In Table 1, we find that solving the LP exactly leads to consistent gains in certified robustness for large networks, with up to 2% additional certified robustness over the best-performing alternative. All the methods in Table 1 are given the same time budget. Indeed, the better theoretical convergence guarantees of ADMM translate to better results in practice: when given a similar budget, the Lagrangian baselines have worse convergence and cannot verify as many examples. Table 1, the state-of-the-art neural network verification method α, β-CROWN [17], which integrates fast linear bounds methods into BaB, is able to achieve certified test accuracies of 65.4%, 56.0%, 37.9%, 19.9% for = 1/255, 1.5/255, 2/255, 2.5/255, respectively, given the same 9s time budget per example. However, as will be shown in Section 4.4, our method outperforms α, β-CROWN in reachability analysis of a neural network dynamical system, which highlights the importance of adapting neural network verification tools for tasks of different structures and scales.

Remark 1 By Table 1, our goal is to compare the performances of ADMM and other Lagrangianbased methods in solving the same LP-based neural network verification problem. The certified test accuracy can be further improved if tighter convex relaxations or the BaB techniques are applied. In fact, for the verification problem considered in
State-robust RL We demonstrate our approach on a non-classification benchmark from reinforcement learning: verifying the robustness of deep Q-networks (DQNs) to adversarial state perturbations [16]. Specifically, we verify whether a learned DQN policy outputs stable actions in the discrete space when given perturbed states. Similar to the large network considered in the CIFAR10 setting, this benchmark has only been demonstrably verified with fast but loose linear bounds-based methods [21].
We consider three Atari game benchmarks: BankHeist, Roadrunner, and Pong 6 and verify pretrained DQNs which were trained with PGD-adversarial training [16]. For each benchmark, we verify the robustness of the DQN over 10, 000 randomly sampled frames as our test dataset.
Similar to the CIFAR10 setting, we observe consistent improvement in certified robustness of the DQN when solving the LP exactly with ADMM across multiple RL settings. We summarize the results using our method and the linear bounds on LP relaxations [20,21] in Table 2.

Speed
We compare the solving speeds of our method with state-of-the-art solvers for convex relaxations: a commercial-grade LP solver, Gurobi. Since Gurobi cannot handle large networks, we benchmark the approaches on a fully connected network that Gurobi can handle which is an MNIST network with architecture 784 − 600 − 400 − 200 − 100 − 10 and ReLU activations (see Appendix A.4 for details).
To demonstrate the effectiveness of GPU-acceleration in the DeepSplit algorithm, we compare the runtime of DeepSplit and Gurobi in solving LP relaxations that bound the output range of the MNIST network with ∞ perturbations in the input. Specifically, for a given example in the MNIST test data set, we apply DeepSplit/Gurobi layer-by-layer to find the tightest pre-activation bounds under the LP-relaxation.
With the Gurobi solver, we need to solve 2 × 600 LPs sequentially to obtain the lower and upper bounds for the first activation layer, 2 × 400 LPs for the second activation layer, and so forth. With DeepSplit, the pre-activation bounds can be computed in batch and allows GPU-acceleration.
In our experiment, we fix the radius of the ∞ perturbation at the input image as = 0.02. For the Gurobi solver, we randomly choose 10 samples from the test data set and compute the pre-activation bounds layer-by-layer. The LP relaxation is formulated in CVXPY and solved by Gurobi v9.1 on an Intel Core i7-6700K CPU, which has 4 cores and 8 threads. For each example, the total solver time of Gurobi is recorded with the average solver time being 275.9 seconds. For the DeepSplit method, we compute the pre-activation bounds layer-by-layer on 19 randomly chosen examples. The algorithm applies residual balancing with the initial ρ = 1.0 and the stopping criterion is given by abs = 10 −4 , rel = 10 −3 . The total running time of DeepSplit is 717.9 seconds, with 37.8 seconds per example on average. With the GPU-acceleration, our method achieves 7x speedup in verifying NN properties compared with the commercial-grade Gurobi solver.

Scalability
To test the scalability and generality of our approach, we consider solving the LP relaxation for a ResNet18, which up to this point has simply not been possible due to its size. The only applicable method here is LiRPA [21]-a highly scalable implementation of the linear bounds that works for arbitrary networks but can be quite loose in practice. For this experiment, we measure the improvement in the bound from solving the LP exactly in comparison to LiRPA.
The ResNet18 network is trained on CIFAR10 whose max pooling layer is replaced by a downsampling convolutional layer for comparison with LiRPA [21]   provable linear bounds for the outputs of general neural networks and is the only method available so far that can handle ResNet18. The ResNet18 is adversarially trained using the fast adversarial training code from [44].
In our experiments, for the first 100 test examples in CIFAR10, we use LiRPA to compute the preactivation bounds for each ReLU layer in ResNet18 and then apply ADMM to compute the lower and upper bounds of ResNet18 outputs (there are 10 outputs corresponding to the 10 classes of the dataset). The ADMM is run with stopping criterion abs = 10 −5 , rel = 10 −4 . With ∞ ball input perturbation of radius = 1/255, we find that exact LP solutions with our ADMM solver can produce substantial improvements in the bound at ResNet18 scales, as shown in Figure 4. For a substantial number of examples, we find that ADMM can find significantly tighter bounds (especially for lower bounds).

Reachability analysis of dynamical systems
We consider over-approximating the reachable sets of a discrete-time neural network dynamical system x(t + 1) = f N N (x(t)) over a finite horizon where x(t) ∈ R nx denotes the state at time t = 0, 1, · · · , f N N is a feed-forward neural network, and n x is the dimension of the system. Specifically, we consider a cart-pole system with 4 states under nonlinear model predictive control [45], and train a 4 − 100 − 100 − 4 neural network f N N (x) with ReLU activations to approximate the closed-loop dynamics with sampling time 0.05s. The 4 states x = [x 1 x 2 x 3 x 4 ] of the cart-pole system represent the position and velocity of the cart, and the angle and angular speed of the pendulum, respectively. Given an initial set X 0 ⊂ R 4 such that x(0) ∈ X 0 , we want to over-approximate the reachable set of N N denotes the t-th order composition of f N N and is a feed-forward neural network itself.
Over-approximating x(t) can be formulated as a neural network verification problem (1) where N N is given by the sequential concatenation of t copies of f N N and the input set X is chosen as the initial set X 0 . A box approximation of x(t) can be obtained by minimizing/maximizing the i-th output of the neural network f DeepSplit: .
We observe that DeepSplit gives a reasonable over-approximation of the reachable set of x (20), while the bounds obtained by α, β-CROWN in this task are an order of magnitude more conservative. Such comparison result holds for other randomly chosen initial sets X 0 too. The details of the experimental setup is shown in Appendix A.4.

Effects of residual balancing
We demonstrate the effects of residual balancing on the convergence of ADMM through the MNIST network (see Appendix A.4 for details). We conduct our experiment on the 1938-th example which is randomly chosen from the MNIST test dataset. For this example, the MNIST network predicts its class (number 4) correctly. We add an ∞ perturbation of radius = 0.02 to the input image and verify if the network outputs are robust with respect to class number 3. This corresponds to setting i = 4, i = 3 and The maximum number of iterations is restricted to 3000. The objective values, primal and dual residuals of ADMM for the network under different fixed augmentation parameters ρ are plotted in Figure 5. The residual balancing in this experiment is applied with τ = 2, µ = 10, and ρ initialized as 10.0.
The effects of ρ on the convergence rates of the primal and dual residuals are illustrated empirically in Figure 5. Despite initialized at a large value 10.0, residual balancing is able to adapt 3-Clause "New" or "Revised" license. the value of ρ and achieves significant improvement in convergence rate compared with the case of constant ρ = 10.0. As observed in our other experiments, with residual balancing, ADMM becomes insensitive to the initialization of ρ and usually achieves a good convergence rate.

Conclusion
In this paper, we proposed DeepSplit, a scalable and modular operator splitting technique for solving convex relaxation-based verification problems for neural networks. The method can exactly solve large-scale LP relaxations with GPU acceleration with favorable convergence rates. Our approach leads to tighter bounds across a range of classification and reinforcement learning benchmarks, and can scale to a standard ResNet18. We leave as future work a further investigation of variations of ADMM that can improve convergence rates in deep learning-sized problem instances, as well as extensions beyond the LP setting. Furthermore, it would be interesting to extend the proposed method to verification of recurrent neural neworks (RNNs) such as vanilla RNNs, LSTMs 9 , and GRUs 10 [46,47,48].
By monotonicity of the sub-differentials, we can write where T f (x) ∈ ∂f (x) denotes a subgradient. By substituting (21) and (22) in (23), we obtain By adding the preceding inequalities, we obtain When ρ > 0, this implies that A 1x1 + A 2x2 = A 1w1 + A 2w2 and hence, the sub-differential ∂g is a singleton.
Convergence. When X is a closed nonempty convex set, I X (x 0 ) is a convex closed proper (CCP) function. Assuming that J is also CCP, then we can conclude that f 1 is CCP. Furthermore, since the sets S φ k are nonempty convex sets, we can conclude that f 2 is CCP. Under these assumptions, the augmented Lagrangian has a minimizer (not necessarily unique) for each value of the dual variables. Finally, under the assumption that the Augmented Lagrangian has a saddle point (which produces a solution to (18)), the ADMM algorithm we have primal convergence r p 2 → 0 (see (15)), dual residual convergence r d 2 → 0, as well as objective convergence J(x ) → J [15]. We remark that the convergence guarantees of ADMM holds even if f 1 and f 2 assume the value +∞. This is the case for indicator functions resulting in projections in the first two updates of Algorithm 1.

A.2 Projection onto convolutional layers
Although a convolution is a linear operator, it is impractical to directly form the inverse matrix for the projection step of the DeepSplit algorithm (12). Instead, we represent a typical convolutional layer f conv with stride, padding and bias as where f pad is a padding step, f circ is a circular convolution step, f crop is a cropping step, f ds is a downsampling step to handle stride greater than one, and f bias is a step that adds the bias. In other words, a convolutional layer can be decomposed into five sequential layers in our neural network representation (3). In practice, we combine the last three steps into one post-processing layer f post to reduce the number of concensus constraints in the DeepSplit algorithm. The projection steps for all of these operators are presented next. An efficient FFT implementation of the projection step for the affine layer f circ is given in Appendix A.3.

Padding
The padding layer f pad takes an image as input and adds padding to it. Denote the input image by y k and the padded image by z k . We can decompose the output z k into two vectors, z 0 k which is a copy of the input y k , and z 1 k which represents the padded zeros on the edges of image. Equivalently, the padding layer z k = φ k (y k ) can be written in an affine form for which the projection operator reduces to the affine case. Cropping The cropping layer f crop crops the output of the circular convolution f circ to the original size of the input image before padding. Denote y k the input image and z k the output image of the cropping layer. By decomposing the input image y k into the uncropped pixels y 0 k and the cropped pixels y 1 k , the cropping layer z k = φ k (y k ) has an affine formulation whose projection operator is given in Section 2.4.

Down-sampling and bias
If the typical convolutional layer f conv has stride greater than one, a down-sampling layer is added in the DeepSplit algorithm, which essentially has the same affine form as the cropping layer with different values of y 0 k and y 1 k . Therefore, the projection operator for the down-sampling layer reduces to the affine case as well.
The bias layer in the DeepSplit algorithm handles the case when the convolutional layer f conv has a bias b k and is implemented by z k = φ k (y k ) = y k + b k . This is an affine expression and its projection operator is given in Section 2.4.

Convolutional post-processing layer
We combine the cropping, down-sampling and bias layers into one post-processing layer, i.e., f post = f bias • f ds • f crop , as shown in (25). This reduces the total number of concensus constraints in the DeepSplit algorithm. Since all the three layers are in fact affine, the post-processing layer is also affine and its projection operator can be obtained correspondingly.

A.3 FFT implementation for circular convolutions
In order to efficiently implement projection onto the convolutional layer (25), recall that we can decompose a convolution into the following three steps: We now discuss in detail how to efficiently perform the (y, z) update for multi-channel, circular convolutions f circ using Fourier transforms. We begin with the single-channel setting, and then extend our procedure to the multi-channel setting.
Single-channel circular convolutions Let U represent the discrete Fourier transform (DFT) as a linear operator, and let W be the weight matrix for the circular convolution f circ (x) = W * x. Then, using matrix notation, the convolution theorem states that where D = diag(U W ) is a diagonal matrix containing the Fourier transform of W and U * is the conjugate transpose of U . Then, we can represent the inverse operator from (12) as Since (I + DD) is a diagonal matrix, its inverse can be computed by simply inverting the diagonal elements, and requires storage space no larger than the original kernel matrix. Thus, multiplication by the inverse matrix for a circular convolution reduces to two DFTs and an element-wise product.
For an input of size n × n, this step has an overall complexity of O(n 2 log n) when using fast Fourier transforms.

Multi-channel circular convolutions
We now extend the operation for single-channel circular convolutions to multi-channel, which is typically used in convolutional layers found in deep vision classifiers. Specifically, for a circular convolution with n input channels and m output channels, we have where f circ (x) j is the jth output channel output of the circular convolution, W ij is the kernel of the ith input channel for the jth output channel, and x i is the ith channel of the input x. The convolutional theorem again tells us that where D ij = diag(U W ij ). This can be re-written more compactly using matrices as where  is a block diagonal matrix with n copies of U along the diagonal  is a block diagonal matrix with m copies of U along the diagonal x n    is a vertical stacking of all the input channels.
Then, we can represent the inverse operator from (12) as where I +DD is a block matrix, where each block is a diagonal matrix. The inverse can then be calculated by the inverting sub-matrices formed from sub-indexing the diagonal components. Specifically, letD j::p be a slice ofD containing elements spaced m elements apart in both column and row directions, starting with the jth item. For example,D 0::p is the matrix obtained by taking the top-left most element along the diagonal of every block. Then, for j = 1 . . . m, we have (I +DD) −1 j::p = (I +DD) j::p Thus, calculating this matrix amounts to inverting a batch of p matrices of size m × m. For typical convolutional networks, m is typically well below 1, 000, and so this can be calculated quickly. Further note that this only needs to be calculated once as a pre-computation step, and can be reused across different inputs and objectives for the network.

Memory and runtime requirements
In practice, we do not store the fully-expanded block diagonal matrices; instead, we omit the zero entries and directly store the the diagonal entries themselves. Consequently, for an input of size p, the diagonal matrices require storage of size O(mnp), and the inverse matrix requires storage of size O(m 2 p). Since the discrete Fourier transform can be done in O(p log p) time with fast Fourier transforms, the overall runtime of the precomputation step to form the matrix inverse is the cost of the initial DFT and the batch matrix inverse, or O(nmp log p + m 3 p). Finally, the runtime of the projection step is O((n + m)p log p + n 2 mp), which is the respective costs of the DFT transformationsŪ andŪ * , as well as the multiplication byD.
Since the number of channels in a deep network are typically much smaller than the size of the input to a convolution (i.e. n < p and m < p), the costs of doing the cyclic convolution with Fourier transforms are in line with typical deep learning architectures.

A.4 Experimental details
CIFAR10 For CIFAR10, we use the large convolutional architectures from [20], which consists of four convolutional layers with 32 − 32 − 64 − 64 channels, with strides 1 − 2 − 1 − 2, kernel sizes 3 − 4 − 3 − 4, and padding 1 − 1 − 1 − 1. This is followed by three linear layers of size 512 − 512 − 10. This is significantly larger than the CIFAR10 architectures considered by [25], and has sufficient capacity to reach 43% adversarial accuracy against an ∞ PGD adversary at = 8/255. The model is trained against a PGD adversary with 7 steps of size α = 2/255 at a maximum radius of = 8.8/255, with batch size 128 for 200 epochs. We used the SGD optimizer with a cyclic learning rate (maximum learning rate of 0.23), momentum 0.9, and weight decay 0.0005. The model achieves a clean test accuracy of 71.8%.

State-robust RL
We use the pretrained, adversarially trained, DQNs released by [16] 11 . These models were trained to be robust at = 1/255 with a PGD adversary for the Atari games Pong, Roadrunner, Freeway, and BankHeist. Each input to the DQN is of size 1 × 84 × 84, which is more than double the size of CIFAR10. The DQN architectures are convolutional networks, with three convolutional layers with 32 − 64 − 64 channels, with kernel sizes 8 − 4 − 3, strides 4 − 2 − 1, and no padding. This is followed by two linear layers of size 512 − K, where K is the number of discrete actions available in each game.
MNIST For MNIST, we consider a fully connected network with layer sizes 784 − 600 − 400 − 200 − 100 − 10 and ReLU activations. It is more than triple the size of that considered by [25] with one additional layer. It is, however, still small enough such that Gurobi is able to solve the LP relaxation, and allows us to compare our running time against Gurobi. We train the network with an ∞ PGD adversary at radius = 0.1, using 7 steps of size α = 0.02, with batch size 100 for 100 epochs. We use the Adam optimizer with a cyclic learning rate (maximum learning rate of 0.005), and both models achieve a clean accuracy of 99%.

ResNet18
To highlight the scalability of our approach, we consider a ResNet18 network trained on CIFAR10 whose max pooling layer is replaced by a down-sampling convolutional layer for comparison with LiRPA [21] 12 , which is capable of computing provable linear bounds for the outputs of general neural networks and is the only method available so far that can handle ResNet18. The ResNet18 is adversarially trained using the fast adversarial training code from [44].
Reachability analysis example Following the notation from Section 4.4, the neural network dynamics f N N has the architecture 4 − 100 − 100 − 4 with ReLU activations and is trained over 10000 randomly collected state transition samples of the closed-loop dynamics over a bounded domain in the state space. With DeepSplit, we solve the LP-based verification problem (7) on truncated neural networks to find all the pre-activation bounds in f (20) N N , while α, β-CROWN is directly run on f (20) N N since it searches pre-activation bounds automatically. Note that the same time budget of 1470s per bound is assigned for both methods. The ADMM is run with stopping criterion abs = 10 −5 , rel = 10 −4 , while α, β-CROWN is run using the default configuration parameters 13 .