On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

Primal-dual interior-point method (PDIPM) is the most efficient technique for solving sparse linear programming (LP) problems. Despite its efficiency, PDIPM remains a compute-intensive algorithm. Fortunately, graphics processing units (GPUs) have the potential to meet this requirement. However, their peculiar architecture entails a positive relationship between problem density and speedup, conversely implying a limited affinity of GPUs for problem sparsity. To overcome this difficulty, the state-of-the-art hybrid (CPU-GPU) implementation of PDIPM exploits presence of supernodes in sparse matrices during factorization. Supernodes are groups of similar columns that can be treated as dense submatrices. Factorization method used in the state-of-the-art solver performs only selected operations related to large supernodes on GPU. This method is known to underutilize GPU’s computational power while increasing CPU-GPU communication overhead. These shortcomings encouraged us to adapt another factorization method, which processes sets of related supernodes on GPU, and introduce it to the PDIPM implementation of a popular open-source solver. Our adaptation enabled the factorization method to better mitigate the effects of round-off errors accumulated over multiple iterations of PDIPM. To augment performance gains, we also used an efficient CPU-based matrix multiplication method. When tested for a set of well-known sparse problems, the adapted solver showed average speed-ups of approximately 55X, 1.14X and 1.05X over the open-source solver’s original version, the state-of-the-art solver, and a highly optimized proprietary solver known as CPLEX, respectively. These results strongly indicate that our proposed hybrid approach can lead to significant performance gains for solving large sparse problems.


I. INTRODUCTION
Linear programming (LP) is the simplest optimization technique, which is widely used in fields as diverse as scientific research, economics and industry [1]. LP problem solvers also serve as drivers for solving nonlinear and mixed-integer linear programming problems [2], which are frequently encountered in highly active research areas like artificial intelligence, cloud computing and cryptography [3]- [12]. Fundamentally, there are two types of methods used for solving large LP problems, namely simplex method and interior-point method (IPM) [13]. Algorithms based on simplex method remain popular for solving LP problems [14].
The associate editor coordinating the review of this manuscript and approving it for publication was Yan-Jun Liu. However, the most widely used variant of IPM, called primal-dual interior-point method (PDIPM) [13], is generally preferred for solving more widely-occurring sparse problems [15].
The process of solving a large LP problem using PDIPM involves compute-intensive operations like matrix-matrix multiplication, matrix-vector multiplication and solving systems of normal equations. Cost-effective availability of immense computational power in graphics processing units (GPUs) has led to their widespread use as general-purpose processors for efficiently performing compute-intensive operations [16]. This general-purpose computing on GPU (GPGPU) approach has also been followed to parallelize the compute-intensive operations in dense versions of both simplex method and PDIPM [17], [18].
It is important to highlight that memory organization of GPUs suites efficient execution of dense linear algebra operations better as compared to their sparse counterparts [19], which apparently diminishes the utility of these devices for solving sparse problems using PDIPM. Nevertheless, there are some GPU-friendly operations even in sparse implementations of PDIPM. These operations have already been leveraged to develop GPU implementations of PDIPM [20]- [22].
PDIPM is an iterative algorithm. During each iteration, it solves a system of normal equations defined by a symmetric positive-definite (SPD) matrix twice for different right-hand side (RHS) vectors [21]. Conjugate-gradient method (CGM) and Cholesky factorization (CF) are the two main methods used for solving systems of normal equations. CGM follows an iterative approach toward finding an approximation of the solution, where the number of iterations is positively related to the accuracy of the solution returned. This method becomes an effective option in case accuracy requirements for the solutions returned can be relaxed [21]. The entire CGM needs to be repeated even for the same SPD matrix if the RHS vector is changed. Its potential for GPU implementation in the context of solving sparse systems of equations lies in its need to repeatedly perform matrix-vector multiplications. However, its inability to return exact solutions may adversely affect the ability of PDIPM to converge quickly [22]. As opposed to CGM, CF can be used to obtain exact solutions. It decomposes SPD matrices into triangular factors. These factors can subsequently be used to find the solutions for different RHS vectors [23]. This ability to reuse the factors, along with higher accuracy of the solutions returned, makes CF a better choice for implementing PDIPM.
Unfortunately, potential of sparse CF for efficient GPU implementation is not immediately apparent. Nevertheless, it has previously been discovered that presence of structures called supernodes in many real-world SPD matrices can be leveraged for efficient GPU implementation of sparse CF. A supernode is a set of columns with (nearly) identical nonzero pattern, which can be treated as a dense submatrix. The established supernodal CF method, subsequently referred to as root CF (RCF), has purely-CPU as well as hybrid (CPU-GPU) implementations [24], [25]. Hybrid RCF performs only two of the three so-called block operations involved in the processing of large supernodes on GPU. This strategy of performing only some operations related to large supernodes on GPU, while executing all other operations on CPU, increases delays in launching GPU routines and CPU-GPU communication overhead. To reduce these losses in efficiency, Rennich et al. [19] later proposed another supernodal CF algorithm, which is subsequently referred to as hybrid subtree CF (SCF). It transfers a set, consisting of a large supernode and all its related supernodes, to GPU as a batch. All the block operations on each of these supernodes are performed on GPU, while calls to some GPU routines processing smaller supernodes are further batched. This reduces waiting times involved in the processing of supernodes on GPU while improving its utilization. In addition, hybrid SCF uses GPU streams to overlap the execution of some operations.
During literature review, we came across a hybrid implementation of PDIPM as early as 2008 reported by Jung and O'Leary [18]. However, their implementation focused only on dense LP problems. Smith et al. [20] later developed GPU-friendly sparse matrix-vector multiplication (SPMV) kernels to accelerate the process of solving systems of normal equations using CGM. Gade-Nielsen et al. [21] also contributed towards the development of hybrid implementations of PDIPM. They focused on a specific class of problems arising in the field of model predictive control. They tried to use hybrid RCF implementation of a popular library known as CHOLMOD [24], [26], but admitted that they were unable to get it working, which forced them to exclusively use CGM for solving systems of normal equations. More recently, Maggioni [22] developed a hybrid implementation of PDIPM for sparse real-world problems. His solver used a combination of CHOLMOD's hybrid RCF implementation and CGM for solving systems of normal equations. CGM was used during the initial iterations, whereas CHOLMOD's hybrid RCF was used during later iterations when accuracy of normal equations' solution is critical for converging to LP problem's solution. Maggioni also developed GPU-based SPMV kernels to accelerate CGM.
Our survey of existing hybrid implementations of PDIPM revealed that hybrid SCF proposed by Rennich et al., available in CHOLMOD beta v. 4.6.0 since 2016 [26], has not been used previously. Existing solvers either use variants of CGM exclusively or use them in combination with hybrid RCF implementation of CHOLMOD's release version. This encouraged us to adapt PDIPM implementation of a popular open-source solver, known as GNU linear programming kit (GLPK) [15], [27], by replacing its simple (non-supernodal) CF with CHOLMOD beta version's hybrid SCF. We made changes to the new CF algorithm that made it compatible with GLPK's PDIPM implementation and ensured that accumulation of round-off errors did not cause premature termination of PDIPM. The adapted CF version was also numerically more stable. We also replaced calls to GLPK's original sparse matrix-matrix multiplication (SPMM) routine with calls to the corresponding routine available in CHOLMOD [28], which exploits the sparsity of input matrices more extensively to reduce the amount of processing required.
The introduction of our modified version of hybrid SCF and calls to the new SPMM routine resulted in significant performance gains for the adapted solver. During our experiments, we used the same set of test problems that was used previously to test the performance of the stateof-the-art solver. Our results show that the proposed solver achieved average speed-ups of around 55.458X, 1.136X and 1.053X over GLPK's original version, the state-of-the-art solver and IBM's highly optimized solver known as CPLEX [29], respectively.
The rest of this paper is organized as follows. In Section II, we introduce CF and the process of solving LP problems using PDIPM. In Section III, we discuss supernodal CF variants used in different solvers considered in this paper. In Section IV, we provide implementation details related to our proposed solver. In Section V, we describe the experimental setup. In Section VI, we discuss performance gains achieved by our proposed solver. We conclude the paper in Section VII by summarizing our work and possibilities for its future extension.

II. BACKGROUND A. CF BASICS
CF is used to decompose an SPD matrix as a product LL T , where L and L T are lower and upper triangular matrices, respectively [23]. Consequently, only one matrix L is sufficient to describe both the factors. Algorithm 1 shows one of the simplest forms of CF. This algorithm runs in O(n 3 ) time but this high cost can be reduced by exploiting symmetric nature of SPD matrices. Moreover, if the same SPD matrix is used to solve for more than one RHS vectors, then the lower time complexity (O(n 2 )) of these solutions further compensates for the high cost of factorization [22]. Further details related to CF are provided later in Section III.

B. SOLVING LP PROBLEMS USING PDIPM
LP involves optimization, minimization or maximization, of the value of a given linear function, called objective function, subject to linear constraints [30]. Any LP problem can be brought into a standard minimization form, having only equality constraints by scaling the objective function, and adding appropriate slack, surplus and artificial variables to its constraints. A minimization problem in standard form, having n number of variables and m number of constraints, can be mathematically represented as follows [15], [17].
where; x represents the n number of variables in the objective function and constraints, c contains coefficients of the variables in the objective function, c 0 is a constant representing a shift in the value of the objective function; A m×n = [a ij ] is the constraint matrix where a ij , i = {1, 2, · · · , m} , j = {1, 2, . . . , n} represents coefficient of the j th variable in the i th constraint.
The formulation given above shows an LP problem represented in its primal form. An alternative form also exists, called the dual form, which is obtained by transposing the constraint matrix of the primal problem. Some algorithms, called primal-dual algorithms, simultaneously solve both primal and dual forms of an LP problem to get quicker convergence to the solution [15].
Since its development in 1953, simplex method was the only algorithm available for solving LP problems. Work during late 1970s and early 1980s led to the development of a new class of algorithms called IPM. These algorithms have a polynomial worst-case time-complexity, which is better than the exponential worst-case complexity of the simplex method [31]. Conceptually, an IPM begins from any point inside the region bounded by constraints, and moves towards the optimal value (vertex) in a stepwise manner; adjusting magnitude and direction of the movement at each step [13]. The most widely-used variation of IPM, called PDIPM, simultaneously solves both primal and dual versions of the problem for quicker convergence [13]. In this paper, we consider PDIPM as implemented in GLPK, which is based on Mehrotra's predictor-corrector method [32].
GLPK's PDIPM implementation needs to solve a system of normal equations described by the SPD matrix S m×m = ADA T twice during each iteration, where D is an n × n diagonal matrix that changes from one iteration to another. In addition, the same system of normal equations, for D = I n×n , needs to be solved twice during initialization. In an implementation of PDIPM for sparse problems, the nonzero pattern of the matrix S remains constant, which can be used to save the cost of computing its memory requirements during each iteration. During a run of PDIPM over N number of iterations, the system of normal equation needs to be solved 2(N + 1) times, whereas the product S = ADA T and factorization of S need to be computed N + 1 times. It is shown in Section IV that these two operations constitute bulk of the overall computational cost of PDIPM.

III. SUPERNODAL CF
We begin this section by introducing supernodal CF. Afterwards, we describe the supernodal CF variants, CPU-based RCF, hybrid RCF and hybrid SCF, as implemented in different versions of CHOLMOD.

A. BASICS
CF, as shown previously in Algorithm 1, is called a left-looking algorithm because it moves from left to right in a column-wise manner to factorize the input matrix lying toward the left of each column. This algorithm provides limited acceleration opportunities only in operations shown in Line 4 and Line 9. However, efficiency of this algorithm for . end for 11. end for for factorizing large matrices can be greatly enhanced by traversing the input matrix in a block-wise manner [22]. Algorithm 2 shows one such method, where a fixed block size is used to process blocks of consecutive columns in a single iteration [23]. In Algorithm 2, three operations, referred to as block operations, are performed iteratively. In the following paragraphs, we briefly describe these operations.
1. update: This operation updates the current block of the input submatrix using the previously computed rows of the factor, lying to the left of the current block. The operation update(i, j, k) represents the update operation applied to j th row in i th block of the input matrix using a product of i th and j th rows in previously computed k th block of the factor. This operation is shown in Line 4 of Algorithm 2. 2. factorize: This operation uses Algorithm 1 to factorize a relatively small square submatrix containing the diagonal of current block. The operation factorize(i) represents the factorization of the input submatrix containing diagonal of the i th block. This is shown in Line 7 of Algorithm 2. 3. solve: This operation computes factorization of the i th block by using the results of update and factorize operations. The operation solve(i, j) denotes factorization of j th row of the i th block, which is performed by using the results of update and factorize operations performed earlier for the i th block. Line 12 of Algorithm 2 shows this operation. Hogg [23], in his detailed survey of factorization techniques for symmetric matrices, described two types of acceleration opportunities available while implementing Algorithm 2. First, the block-wise approach of Algorithm 2 can achieve high levels of performance by using cache-friendly linear algebra operations available in highly efficient versions of basic linear algebra subprograms (BLAS) and linear algebra package (LAPACK) libraries [33], [34]. It is important to mention that GPU implementations of  these libraries can also be used to achieve high degrees of fine-grained parallelization for processing large dense matrices. Second, Hogg suggested that further performance gains, in terms of coarse-grained parallelization, can be achieved by overlapping the execution of block operations that are not dependent on each other. Table 1 shows the mutual dependencies of different block operations. It is customary for any implementation of CF to perform a preliminary analysis phase to build an elimination tree representing dependencies of all its block operations [23], [24]. The elimination tree can be used to find the dependency of a block operation on the results obtained by the processing of other blocks as follows. (1) If two blocks lie on different branches of the elimination tree then there is no dependency between them. (2) The processing of an ancestor node depends on the results obtained by processing of its descendants. Workflow represented as an elimination tree helps in determining parallelization possibilities in executing block operations [23].
A further advantage of the block-based approach is that it can be used to efficiently factorize real-world sparse matrices by leveraging the presence of supernodes. Fig. 1 illustrates how columns with the same nonzero pattern can be grouped together to form a supernode. The advantage of using supernodes is that they can be represented as dense blocks in the input matrix. Efficient implementations of dense BLAS and LAPACK routines can subsequently be used to accelerate block operations for supernodes [25]. Furthermore, the dependencies of supernodes upon each other can be determined during the analysis phase, which enables execution of independent block operations in parallel [23].

B. RCF
The CPU implementation of CF available in CHOLMOD's release versions, i.e. RCF, exploits acceleration opportunities by using CPU-based BLAS and LAPACK routines.
It uses general matrix-matrix multiplication (GEMM) and symmetric rank-k update (SYRK) BLAS routines to perform the update block operation, while using the triangular solve (TRSM) BLAS routine to perform the solve operation. In addition, dense CF (POTRF) LAPACK routine is used to perform the factorize block operation [25].
A widely-used GPU-based BLAS library, NVIDIA CUBLAS [35], provides more efficient implementations of BLAS routines as compared to its CPU-based counterparts for processing large dense matrices. Hybrid RCF implementation in CHOLMOD transfers a large supernode to GPU for accelerating its update and solve block operations using CUBLAS' versions of GEMM, SYRK and TRSM. Unfortunately, it actively processes only one supernode on GPU at a time, which leads to underutilization of GPU's compute units and increased latencies in launching CUBLAS routines while a supernode is transferred to GPU [19]. Once GPU finishes performing update and solve operations, the results of these operations are written back to host's memory. This division of workload between CPU and GPU for a single supernode increases communication overhead. Moreover, the only type of coarse-grained parallelization is achieved in the form of overlapping block operations' execution with CPU-GPU data transfers.

C. SCF
As mentioned previously in Section I, CHOLMOD beta v. 4.6.0 implements hybrid SCF algorithm proposed by Rennich et al., which takes advantage of both fine-grained and coarse-grained parallelization opportunities. Like hybrid RCF, it uses CUBLAS to accelerate the execution of update and solve block operations for large supernodes on GPU. However, in contrast to hybrid RCF, it exploits the independence of supernodes belonging to different branches of the elimination tree to achieve coarse-grained parallelization. Under this strategy, once a supernode is selected for processing on GPU, the entire subtree rooted at the selected node is copied to GPU's memory. All the three block operations, including factorize, for all these supernodes are performed on GPU. The POTRF routine in a GPU implementation of LAPACK called CUSOLVER is used to perform the factorize block operation for large supernodes [36]. Calls to GPU routines for processing smaller supernodes are batched together to reduce kernel-launch latencies. Once the entire subtree has been processed, only then its results are copied back to host's memory. In addition to overlapped execution between CPU and GPU, hybrid SCF also allows for concurrent processing of block operations on GPU using different CUDA streams.
In summary, hybrid SCF significantly reduces overheads in terms of both kernel-launch latencies and CPU-GPU communication, while achieving a better utilization of GPU's computational resources. Rennich et al. showed that this method provides significant speed-up over hybrid RCF [19]. When we examined the source code of CHOLMOD release v. 5.4.0 and beta v. 4.6.0, we found out the default criteria to select a (root) supernode for processing on GPU to be;  (1) the number of columns in the supernode to be greater than 512, or (2) the number of rows in the supernode to be greater than 256 and its number of columns to be greater than 32. In the rest of this paper, we refer to a supernode as being GPU-friendly if it meets either of these criteria. CF algorithms considered in this paper are summarized in Table 2.

IV. IMPLEMENTATION DETAILS
To make our results reproducible, we selected GLPK's PDIPM implementation as a baseline solver. We decided to adapt it by modifying its CF and SPMM routines. Modification of GLPK's CF implementation also necessitated changing the routine used for solving systems of normal equations. The decision to focus only on CF and SPMM was based on our observation that more than 99% of the overall solution time was contributed by these two operations, as shown in Table 3. The weightages shown in Table 3 were obtained by averaging execution times of different operations in the original GLPK implementation for the selected test problems specified later in Section V. It is obvious that even though SPMM is the most time-consuming operation, any enhancement in terms of CF can also result in significant performance gains.
In the rest of this section, we explain the procedure adopted for introducing CHOLMOD's SPMM implementation to GLPK followed by description of a simple modification that made CPU-based and hybrid versions of RCF compatible with GLPK. Afterwards. we explain modifications introduced to hybrid SCF that allowed us to use it in our proposed GLPK-based solver. To the best of our knowledge, GLPK has previously not been adapted in such a way. We have made the modified CHOLMOD and GLPK source code, along with detailed usage instructions, available publicly [37].

A. ADAPTING GLPK
We began by undertaking the relatively simple task of exploring efficient alternatives to GLPK's original SPMM routine used to compute the product S = ADA T , where, as mentioned previously, D represents a diagonal matrix. If the matrix D is temporarily ignored for ease of illustration, then the following linear combination represents how GLPK's SPMM routine computes each nonzero element (s ij ) of the matrix S.
It is evident that all the terms having their second operand (a jk ) equal to zero are eliminated, thus reducing the number of floating-point multiplications. We subsequently learnt that CHOLMOD also provides a CPU implementation of SPMM, proposed by Davis [28], which can performs the same product more efficiently as follows.
In contrast to GPLK's SPMM routine, this implementation eliminates all the terms in which either of the two operands is equal to zero, which further reduces the number of floating-point operations. It is shown later in Section VI-A that the replacement of GLPK's original SPMM routine with its counterpart from CHOLOMD led to significant overall performance gains for our proposed solver. GLPK uses compressed sparse row (CSR) format to store the matrix A, which is the exact transpose of the compressed sparse column (CSC) format used by CHOLMOD. This relationship greatly simplified conversion of matrices between storage formats used by GLPK and CHOLMOD.
We also considered the option of performing SPMM using popular GPU-based sparse linear algebra libraries like CUSP and CUSPARSE [38], [39]. In addition to these libraries, we came across other works related to GPU implementation of SPMM as well [40]- [42]. However, we preferred CHOLMOD over all these GPU implementations in this work because both hybrid RCF and SCF process only large supernodes on GPU and require the matrix S to be stored in the host's memory.
It was explained in Section III-B and Section III-C how different versions of supernodal CF available in CHOLMOD use CPU and GPU implementations of BLAS and LAPACK to accelerate execution of block operations related to large supernodes. Initially, we introduced these CF methods, without any modification, to GLPK's PDIPM implementation by replacing calls to its original CF-related routines with calls to appropriate routines from different versions of CHOLMOD. Unfortunately, CF implementations of GLPK and CHOLMOD handle the situation differently when matrix S stops being positive-definite because of accumulation of round-off errors over multiple iterations, which is indicated by the occurrence of non-positive diagonal elements (s i,i ). GLPK handles this situation by replacing Line 7 of Algorithm 1 with the following conditional statement in accordance with an approximation method proposed by Wright [43].
where, DBL_MAX represents a large real number defined in the C header file <float.h>, which approximates infinity. In contrast to GLPK's use of Wright's approximation method, CF implementations available in different versions of CHOLMOD consider the occurrence of a non-positive diagonal element to be an error condition and terminate the factorization process. In the next two subsections, we describe how we adapted RCF and SCF to make them compatible with GLPK's PDIPM implementation.

B. ADAPTING CPU-BASED AND HYBRID RCF
As mentioned previously, both CPU-based and hybrid versions of RCF available in CHOLMOD release v. 5.4.0 perform the factorize block operation using the POTRF routine of a CPU-based LAPACK implementation available on the system. On detecting a non-positive diagonal element, the default behavior of POTRF routines is to terminate after returning an error. To modify this behavior, we decided to use a LAPACK version that comes bundled with an open-source BLAS implementation called OpenBLAS [44]. We found that OpenBLAS was already referenced as a possible BLAS candidate in the main configuration file of CHOLMOD's parent package called SuiteSparse [26]. We initially introduced the same conditional statement given in the previous subsection to POTRF, which made both versions of RCF compatible with GLPK's PDIPM implementation. However, we observed that even after this modification, the solutions of some test problems suffered from numerical instability. Further experiments revealed that relaxation of the original condition as follows resolved these difficulties.
where, represents a very small positive real number. We experimented with different values of and found = 1 × 10 −14 to be a suitable value that ensured convergence to the solution in minimal number of iterations while eliminating numerical instability. We also replaced the original conditional statement of GLPK with the new statement.

C. ADAPTING HYBRID SCF
As opposed to RCF, the modification of hybrid SCF available in CHOLMOD beta v. 4.6.0 was not as straightforward since it uses two different GPU routines for performing the factorize block operation: 1. Hybrid SCF calls CUSOLVER's double-precision version of POTRF function named cusolverDnDpotrf() to factorize large supernodes on GPU [36]. VOLUME 8, 2020 Hybrid SCF does not check for the presence of non-positive diagonal elements before calling this function. Moreover, it does not check the status returned by the function cusolverDnDpotrf() for the occurrence of errors. We could not modify the function cusolverDnDpotrf() since CUSOLVER is a proprietary library. Consequently, we developed a new CUDA kernel that determined if a non-positive diagonal element existed in the supernode. Subsequently, the function cusolverDnDpotrf() was called only if no non-positive diagonal element was found. Otherwise, the supernode was copied back to the host to be factorized using the same POTRF routine that was used by RCF. The status returned by the function cusolverDnDpotrf() was also checked to determine if it successfully performed the factorize operation. In case of an error, a flag was set for indicating to the calling PDIPM routine that it should switch to using CPU-based RCF for its remaining iterations. However, during our experiments, we found this error condition to be rare, which occurred only once during the final iterations of one test problem. Fig. 2 summarizes these adaptations. 2. Hybrid SCF uses a custom CUDA kernel that performs the factorize block operation on relatively small supernodes as a batch. We simply introduced the modified conditional statement, shown in the previous subsection, to this CUDA kernel for handling non-positive diagonal elements. We also fixed some bugs related to memory management and creation of handles to NVIDIA libraries CUBLAS and CUSOLVER in hybrid SCF implementation, which originally caused PDIPM to terminate abnormally for some test problems. Relevant debugging details are available in the publicly available source code package [37].

D. GLPK-BASED SOLVERS
To compare the effects of using different CF implementations on the performance of PDIPM, we further modified GLPK's PDIPM implementation that used CHOLMOD's SPMM routine to get three new version by using the three CF adaptations discussed in the previous two subsections. The first version uses our adaptation of CPU-based RCF. In the rest of this paper, we refer to this solver as CPU-RCF. The second version uses our adaptation of hybrid RCF. In the rest of this paper, we refer to this solver as HYB-RCF. The third version, our proposed solver, uses our adaptation of hybrid SCF. In the rest of this paper, we refer to our proposed solver as HYB-SCF. We compared the performance of all these solvers against original PDIPM implementation of GLPK, which is referred to as CPU-BASE in the rest of this paper. These solvers are summarized in Table 4.

E. CPLEX-BASED SOLVER
In addition to the solvers discussed in the previous subsection, we also used a simple C program to invoke CPLEX's PDIPM (barrier method) routine with its default settings. In the rest of this paper, we refer to this program as CPU-CPLEX. Furthermore, we refer to the state-of-the-art hybrid solver, proposed by Maggioni, as HYB-MAGGIONI.
Maggioni reported results comparing the performance of his solver against CPLEX. For an approximate performance comparison between HYB-SCF and Maggioni's solver, we used; (1) results of our experiments, as discussed later in Section VI, comparing the performance of HYB-SCF against CPU-CPLEX, and (2) speed-ups reported by Maggioni for his solver against CPLEX. It is important to mention that Maggioni relaxed the tolerances (relative primal  infeasibility, relative dual infeasibility and primal-dual gap) used to determine convergence of the LP problem's solution. However, in the interest of a fairer performance comparison, we did not modify default tolerances in either GLPK-based solvers or CPU-CPLEX.

V. EXPERIMENTAL SETUP
To test the relative performance of different solvers, we selected a set of eight well-known sparse problems, previously used by Maggioni to test the performance of the state-of-the-art hybrid solver [22]. This selection enabled us to fairly compare the performance of HYB-SCF against its most recent competitor (HYB-MAGGIONI). Furthermore, these problems vary in size and density (percentage of nonzero values), which provides an opportunity to study the effects of variations in these parameters on the performance of different solvers. Details of these problems are given in Table 5. Except for (probably) one problem, RAIL507, all the problems can be easily categorized as being large LP problems. RAIL507 is also the least sparse problem, having a density of around 1.3%. On the other hand, WATSON2, which is hyper-sparse, is the largest problem in the set. Table 6 compares the hardware we used during our experiments against the hardware used by Maggioni to test the performance of his solver. These hardware specifications indicate that the speed-up reported later in Section VI-B, for our proposed solver (HYB-SCF) over the state-of-the-art solver (HYB-MAGGIONI), is not because of any peculiar hardware choices that could have unduly favored GPU over CPU or vice versa. Nevertheless, this speed-up remains the outcome of an estimated performance comparison.
In terms of software, we modified PDIPM implementation of GLPK v. 4.65 as discussed in the previous section. We used compute unified device architecture (CUDA) toolkit v. 9.0 in Ubuntu 17.04 operating system to execute GPU-end routines. We modified LAPACK implementation available in OpenBLAS v. 0.3.6, as discussed in the previous section, to perform factorize block operation using CPU [44]. We used IBM ILOG CPLEX Optimization Studio v. 12.6 to implement CPU-CPLEX.
We solved the eight test problems five times using all the GLPK-based solvers, except for CPU-BASE which took unreasonably long (more than four hours) to solve some problems. For each run of these solvers, we measured overall solution time, average per-iteration solution time, average CF time and average SPMM time to compute S = ADA T . We also solved the test problems five times using CPU-CPLEX.

VI. RESULTS
In this section, we first compare the performance of the four GLPK-based solvers; CPU-BASE, CPU-RCF, HYB-RCF and HYB-SCF. Afterwards, we compare HYB-SCF's performance against CPU-CPLEX as well as Maggioni's state-of-the-art hybrid solver.

A. COMPARISON OF GLPK-BASED SOLVERS
Due to a high degree of variation in timing measurements for different GLPK-based solvers, we opted to tabulate these measurements in Table 7 and Table 8. Table 7 shows number of iterations and total solution time, whereas Table 8 shows per-iteration time, average CF time and average SPMM time. Speed-ups achieved by the three modified versions of GLPK (CPU-RCF, HYB-RCF and HYB-SCF) over the baseline version (CPU-BASE) in terms of total solution time, average CF time and average SPMM time are shown in Fig. 3a, Fig. 3b and Fig. 3c, respectively.
As mentioned previously, we solved each problem five times using CPU-RCF, HYB-RCF and HYB-SCF. Each run of a solver for a specific problem executed the same number of iterations, except for HYB-SCF's attempts to solve RAIL2586 and RAIL4284. We observed that HYB-SCF executed 109 to 112 iterations for solving RAIL2586 and 92 or 93 iterations for solving RAIL4284. These variations explain the presence of fractional parts in the number of iterations column in Table 7 corresponding to HYB-SCF for these two problems. These variations can be attributed to the relatively large number of iterations required to solve these problems. This increases the chances of differences in intermediate VOLUME 8, 2020    results of the same set of floating-point operations while performing POTRF using GPU, which is caused by an absence of any predefined order in the execution of GPU threads. Fig. 3a and Table 7 show that HYB-SCF outperformed CPU-BASE for all the problems. Fig. 3a also shows that HYB-SCF's speed-up generally improved with an increase in the size of problems' constraint matrices (m × n). A lack of significant speed-up for the largest problem, WATSON2, is because of its hyper-sparse nature. As shown in Table 9, the matrix S in this problem does not contain supernodes, which eliminates any possibility of gaining speed-up by using supernodal CF.
In terms of HYB-SCF's performance gains over the solvers that use RCF, i.e. CPU-RCF and HYB-RCF, it clearly outperformed them for five of the eight test problems. Among the remaining three problems, matrix S in RAIL507 was obviously not large enough to contain any substantial supernodes suitable for processing on GPU. Similarly, the hyper-sparse nature of WATSON2 meant that its matrix S also did not have any supernodes for supernodal factorization to achieve any speed-up, whether on CPU or on GPU. Lastly, RAIL2586 is interesting in the sense that it is the second-least sparse problem among all the test problems. As shown in Table 9, the matrix S in this problem has only four supernodes large enough to be suitable for processing on GPU. Furthermore, the cumulative size of these four supernodes is the smallest among all the problems having supernodes, except for the very small matrix S in problem RAIL507. This relatively small number and size of GPU-friendly supernodes in RAIL2586 inhibited the ability of both the hybrid solvers, HYB-RCF and HYB-SCF, to gain performance over the CPU-based solver (CPU-RCF). However, it is encouraging to note that even for the hyper-sparse problem, WATSON2, the performance of HYB-SCF did not significantly drop below that of the other solvers.
HYB-SCF achieved overall average speed-ups of 55.458X, 1.806X and 1.747X over CPU-BASE, CPU-RCF and HYB-RCF, respectively. It achieved maximum speed-ups of 151.301X, 2.508X and 2.373X over CPU-BASE, CPU-RCF and HYB-RCF, respectively. It is interesting to note that, as opposed to HYB-SCF, HYB-RCF could manage to achieve discernible speed-ups over CPU-RCF for only three of the test problems; KARTED, DEGME and TP6. This (at least partially) explains the preference given to CGM over CF during previous attempts to use GPU technology for accelerating PDIPM [21], [22]. Fig. 3b shows the speed-up curves of CF in the four GLPK-based solvers, which follow reasonably similar trends as shown in Fig. 3a for overall speed-up. It demonstrates that hybrid SCF had a significant impact in enhancing the overall performance of its corresponding solver; HYB-SCF. We calculated the average contribution of CF towards overall reduction in solution time achieved by HYB-SCF against CPU-BASE to be around 38.12% [37]. Fig. 3c shows the speed-up achieved by CHOLMOD's SPMM routine used in the three adapted solvers (CPU-RCF, HYB-RCF and HYB-SCF) against the corresponding routine used in CPU-BASE. It is clear that except for a relatively small speed-up for WATSON2, which is a hyper-sparse problem, CHOLMOD's SPMM routine provided significant speedups over the corresponding routine of GLPK for all other test problems. We calculated the average contribution of SPMM towards overall reduction in solution time achieved by HYB-SCF against CPU-BASE to be around 61.88% [37]. We subsequently discovered that if only GLPK's SPMM routine was replaced with the corresponding routine of CHOLMOD, without replacing GLPK's original CF routine, then average speed-up achieved by the proposed solver over CPU-BASE would drop from 55.458X to 2.711X.

B. HYB-SCF VS. CPU-CPLEX AND HYB-MAGGIONI
Performance comparison of HYB-SCF against CPU-CPLEX and HYB-MAGGIONI is summarized in Table 10 and Fig. 4. Table 10 shows that HYB-SCF achieved a reduction of 5.68s over CPU-CPLEX in terms of average solution time, which translates into an average speed-up of 1.053X. However, the speed-up curve of HYB-SCF, as shown in Fig. 4a, lacks any meaningful trend with an increase in problem size. The irregular nature of this curve can be attributed to the following two factors. First, the CF algorithm used in HYB-SCF leverages the presence of GPU-friendly (large) supernodes to gain performance. Consequently, performance gained by HYB-SCF over CPU-CPLEX is positively related to the percentage of GPU-friendly supernodes in the matrix S. Second, CPLEX is a highly optimized CPU-based solver, which contains techniques to efficiently process dense columns, in addition to other features to enhance its performance [29]. On the other hand, HYB-SCF is based on PDIPM as implemented in GLPK, which does not have any method for efficient processing of dense columns [15]. Consequently, speed-up achieved by HYB-SCF over CPU-CPLEX is negatively related to the percentage of dense columns in the matrix S. To assign appropriate weightages to these factors while comparing performance of HYB-SCF against CPU-CPLEX, we devised the following ratio to sort test problems.

τ =
Percentage of GPU−friendly supernodes in the matrix S Percentage of dense columns in the matrix S VOLUME 8, 2020    4b shows that the speed-up achieved by HYB-SCF over CPU-CPLEX generally follows an upward trend as the ratio τ increases. We have also used the same ratio to sort problems in Table 10. These results signify the true impact of our proposed approach, which made the performance of a primitive implementation of PDIPM competitive with CPLEX's highly optimized industry-standard CPU implementation. It must be emphasized that the speed-up achieved by our proposed solver over CPU-CPLEX depends on performance gains achieved by the adapted CF implementation as well as CHOLMOD's highly efficient SPMM routine. We also compared the solutions returned by HYB-SCF and CPU-CPLEX, which closely matched each other. In concrete terms, there was an average percentage error of 5.80 × 10 −7 % and a maximum percentage error of 2.74 × 10 −6 % in the solutions returned by HYB-SCF as compared to CPU-CPLEX.
Finally, we compare the performance of HYB-SCF against HYB-MAGGIONI. Table 10 shows approximate speed-up achieved by HYB-MAGGIONI over CPU-CPLEX for each test problem, as reported by Maggioni [45]. It is evident from these speed-up measurements that HYB-MAGGIONI outperformed CPU-CPLEX for only three test problems; RAIL507, RAIL2586 and KARTED. The maximum speed-up achieved was around 1.29X. Subsequently, we used these speed-up measurements in conjunction with our measurements of CPU-CPLEX's solution times to estimate HYB-MAGGIONI's solution times, as shown in Table 10 and plotted in Fig. 4c. As discussed earlier in Section IV-E and Section V, this comparison is not biased towards HYB-SCF.
The speed-up curve of HYB-SCF in Fig. 4c follows similar trends as shown in Fig. 4b. The reasons for this symmetry between speed-up curves of HYB-SCF over CPU-CPLEX and HYB-MAGGIONI are twofold. First, as discussed previously in Section I, Maggioni used CHOLMOD's hybrid RCF routine only during some iterations of PDIPM. Furthermore, CHOLMOD's hybrid RCF routine provides meaningful speed-up over its CPU-based counterpart only for three of the test problems. Consequently, as opposed to HYB-SCF, the presence of GPU-friendly supernodes did not result in significant performance gains for HYB-MAGGIONI. Second, HYB-MAGGIONI used a sparse representation of matrices that was apparently able to handle the presence of dense columns efficiently [22], [45]. On the other hand, HYB-SCF did not have any such mechanism. The net effect is that the speed-up of HYB-SCF generally increases with an increase in the ratio τ . If we aggregate solution times for HYB-SCF and HYB-MAGGIONI for all the test problems, we get an estimated average speed-up value of 1.136X. Interestingly, HYB-SCF also outperformed HYB-MAGGIONI for the hyper-sparse problem, WATSON2, which contains neither dense columns nor any supernodes. This observation is significant, because it implies that HYB-SCF outperformed Maggioni's state-of-the-art hybrid solver for all the test problems, except for the three problems having large number of dense columns.

VII. CONCLUSION AND FUTURE WORK
This work began with the premise that a modern GPU is potentially a high-performance computing resource for developing efficient LP problem solvers. We were particularly interested in PDIPM. However, the affinity of GPUs for dense linear algebra operations and the suitability of PDIPM for solving sparse LP problems apparently lead to a contradiction. Nevertheless, presence of supernodes, which can be treated as dense submatrices, in sparse matrices provides opportunities for GPU acceleration of PDIPM.
In this paper, we proposed a new LP problem solver based on the PDIPM implementation of GLPK. The new solver significantly accelerated the process of solving real-world sparse problems. Like the state-of-the-art solver, our proposed solver also used a CF method that exploited the presence of supernodes in sparse matrices. However, as opposed to the state-of-the-art solver, our proposed solver used an adaptation of a more efficient CF method that performs batched processing of related supernodes on GPU, which has not been used previously for accelerating PDIPM. Our adaptation made the new CF method more resilient to the effects of round-off errors that accumulate in the elements of the matrix S over multiple iterations of PDIPM. In addition to the new CF method, we used a CPU-based SPMM routine that required fewer floating-point operations as compared to GLPK's original routine. For a set of well-known sparse problems, our proposed solver was able to achieve average speed-ups of around 55X and 1.05X over original PDIPM routines of GLPK and CPLEX, respectively. Moreover, we estimated that it achieved an average speed-up of around 1.14X over the state-of-the-art hybrid solver.
Finally, we provide a couple of future research possibilities related to this work. First, the effects of using GPUbased SPMM methods to compute the matrix S, mentioned in Section IV-A, need further investigation. Second, during the course of this work, we came across an enhanced version of the hybrid SCF algorithm, proposed by Tang et al. [46], which uses multiple CPU threads and pipelining to achieve further performance gains. Potential of this CF method for accelerating PDIPM may also be studied in a future work.