Strength Check of Aircraft Parts Based on Multi-GPU Clusters for Fast Calculation of Sparse Linear Equations

In order to improve the cost-effectiveness ratio, the next-generation vehicle needs to meet the requirements of reuse, while adopting a lighter structural weight, so it is necessary to realize the strength calculation and condition monitoring of key components in the digital twin. Most of the current monitoring methods are based on the characteristics of various data acquisition systems, but they need the support of a large number of flight data. The disadvantages of the above strategy can be avoided by reducing the structure of aircraft components to a finite element model and quickly checking the key components in the health management system. In order to solve the problem of fast calculation of the finite element model of the key components of the aircraft, a parallel algorithm and framework of large-scale sparse matrix preprocessing conjugate gradient method based on CUDA(Compute Unified Device Architecture) technology is proposed in the multi GPU(Graphics Processing Unit) workstation cluster environment. Once the sparse matrix is too large to be processed in a single workstation, this paper discusses how to realize the optimized data segmentation in the distributed multi-GPU computing environment. For the problem of iterative solution of matrix preprocessing, two preprocessing strategies of matrix bandwidth reduction parallelization and incomplete Cholesky decomposition are proposed, and asynchronous task concurrency and load balancing strategies are designed on the architecture. The calculation of some examples in the standard sparse matrix database shows that the algorithm and architecture proposed in this paper have the ability to solve large-scale sparse matrix quickly and efficiently, and can complete the fast strength verification of vehicle components.


I. INTRODUCTION
With the gradual reduction of the weight ratio of the aircraft, and the transformation from a slender body to a composite body and a lifting body, the vibration of the aircraft during flight has an increasingly significant impact on its dynamic characteristics [1]. Due to the reduced stiffness of the aircraft, real-time calculation of the downlink structural strength data is of great significance to ensure flight safety under timevarying conditions [2]. Digital twin system is the basis for the system-level fault tolerance of aircraft. For the fault detection and instruction reconstruction that may occur in the flight The associate editor coordinating the review of this manuscript and approving it for publication was Seok-Bum Ko . process, the computer on the rocket cannot carry out complex calculation due to the limitation of hardware resources. The common method is to send the sensor data back to the ground, and make complex and accurate judgment through the workstation with stronger computing power of the ground system. At present, the spacecraft's health management system is mainly aimed at the two modules of heat protection and engine, and has achieved very good results. In terms of thermal protection, the bending and buckling of Integrated Thermal Protection System (ITPS) are usually considered [3]. The fatigue damage caused by temperature gradient is mainly considered for the possible failure of engine in operation. For the above two kinds of stress-strain problems, the finite element method is usually used to calculate [4]- [6]. It can be seen that the finite element modeling of key components and the rapid solution of corresponding statics equations are the focus of the aircraft health management system. In order to ensure the real-time performance of computation, parallel computing is bound to be adopted. With the development of computer technology, the number of transistors in GPU (graphics processing units) is increasing rapidly, which has realized the computing power far beyond that of multi-core CPUs. While the bottleneck of multi-core CPU can't be solved due to the temperature wall and power wall, GPU's computing power develops rapidly due to the continuous progress of semiconductor technology. At the same time, the concept of GPGPU (general purpose computing on graphics processing units) and the CUDA released by NVIDIA make it more convenient to use GPU for high-performance computing, and become an important branch in the field of high-performance computing. Programmers can take advantage of the C/C++ interface provided by NVDIA for parallel programming and achieve several times the performance of the CPU on some applications [7], [8].
Many scholars have done related research on the parallel calculation of element stiffness matrix [8]- [10]. At the same time, for the solution of sparse linear equations, GPU parallel solving algorithms based on acoustics [11], electromagnetics [12], and fluid mechanics [13] based on finite element or finite volume methods have also been applied. Basermann et al. [14] elaborated the implementation strategy of sparse matrix preprocessing conjugate gradient method on large-scale parallel machine in 1995, and discussed the data storage and segmentation problems in detail. In 2003, Bolz et al. [15] and others realized the solver design of conjugate gradient method on GPU, that is to say, the iterative solution of large-scale linear equation was realized by using image API before the general calculation of GPU. However, CUDA appeared in 2008, Bell and Garland [16] and others completed the algorithm design of sparse matrix vector multiplication (SpMV), which is the most important algorithm in the application of iterative method. Zaharovits et al. [17] designed a shared memory parallel conjugate gradient method solver for the topology optimization of the finite element method. Amorim et al. [18] proposed the Meshless Local Petrov Galerkin Method (MLPG) algorithm to assemble the element stiffness matrix in 2019, and compared the speed difference between the BICG method and related methods of the conjugate gradient method family. Previous research experience shows that if the dimensions of the sparse equations considered are small, using a GPU has no obvious advantage over the CPU. Due to the data transmission and instruction sending between the CPU and the GPU, running the direct method on the CPU is even faster than implementing the iterative method on the GPU in this scenario. When the matrix dimension is large, the direct solution takes a non-linear increase in time and requires too much CPU memory. At this time, the advantages of the GPU become obvious. In this situation, the iterative solver becomes the only choice. However, it is worth noting that in order to use GPU with many threads to work in parallel, the data must be stored in GPU memory. In order to ensure the validity of the results, the strength check calculations involved in this paper usually use double precision floating point arithmetic. Generally, the Tesla series GPU is used as a computing card, and its graphics memory ranges from 8G (P4) to 32G (V100). Compared to the fact that the workstation motherboard can hold hundreds of GB, the amount of graphics memory is not enough to accommodate all the data involved in large-scale computing. Therefore, in order to avoid memory overflow, the matrix is usually decomposed into several sub-matrices and transferred to a GPU for calculation in batches. Most previous studies on accelerated FEM or sparse linear solvers have focused on algorithms designed on a single GPU, without considering the large-scale calculations where the dimensions of the matrix exceed the graphics memory. The computing environment of this paper is two workstation clusters with multi-core CPU and dual GPU, which is regarded as a hybrid Symmetrical Multi-Processing (SMP)/ distributed memory system. Because the distributed characteristics of GPU computing are similar to cluster computing, but there are significant differences in computing environment, so we need to design different communication, distribution and synchronization strategies for the two architectures.
This paper discusses the problem of sparse linear algebra solution based on GPU accelerated finite element model of aircraft. Based on the two most important libraries released by NVIDIA, cuBLAS [19] and cuSPARSE [20], focusing on the most time-consuming iterative solution steps of finite element matrix, including the overall analysis of bandwidth optimization and iterative solution of large sparse matrix. Since the conjugate gradient method has been studied deeply in the related literature, the bandwidth optimization and data segmentation based on GPU are described in detail here. The bandwidth optimization of sparse matrix can not only reduce the number of iterations of conjugate gradient method to reduce the computation, but also decrease the block of sparse matrix, so as to minimize the communication between computing nodes.

II. DESIGN OF PARALLEL EIGENVALUE SOLVING ALGORITHM FOR LARGE SPARSE MATRIX
Both the analysis of aero-elastic coupling of aircraft and the calculation of yield strength after component damage are essentially eigenvalue problems of linear equations: A ∈ R nxn , b ∈ R n , x is the position vector. The solution x (1) can be obtained by the equivalent equation Mx = b, where M is the approximation of A. It can be deduced that the error x between the above formula and the original equation is: (1) (2) VOLUME 8, 2020 Since the direct method is difficult to solve, it can be converted into approximate equation: (1) An approximate correction amount x is obtained, so the corrected approximate solution is: Repeat the above steps if the accuracy requirements are not met. This gives the general formula for the iterative method: If M = I, the famous Richardson iteration will be obtained after standard splitting: where is the residual of the previous step. As mentioned in the previous chapter, in large-scale numerical simulation calculation, the solution of sparse linear system occupies a considerable part of calculation time and resources, so higher requirements are imposed on the solution speed of sparse linear system. The following instruction will be divided into two parts to describe the parallel design of RCM algorithm. The first part is the introduction of RCM algorithm and pseudo peripheral vertex. The second part is the parallel design of CUDA algorithm based on linear algebra.

A. DESIGN OF PARALLEL BANDWIDTH OPTIMIZATION ALGORITHM FOR LARGE SPARSE MATRIX
Section II.A is divided into two parts in the revised manuscript. The first part is the introduction of the RCM algorithm and the calculation process of pseudo-peripheral vertex. The second part is the design of CUDA parallel algorithm based on linear algebra.

1) DEFINITIONS AND NOTATIONS
In the stress and strain analysis of the refined analysis and flight simulation of large spacecraft, the number of elements and nodes is constantly changing due to the fuel consumption and the separation of each component, so the equations need to be renumbered and reorganized for calculation [21]. As far as the strength check calculation is concerned, the linear algebra calculation needs to be completed every few hundred milliseconds, so the timeliness of calculation is very high. Bandwidth reduction can significantly reduce the amount of calculation, enhance the stability of the calculation, and speed up the iterative method [22]. Related research is done on parallel computing of bandwidth optimization problems. For the finite element model of the aircraft, the stiffness matrix is a symmetric matrix, and the bandwidth reduction is to find the displacement matrix P T , which makes PAP T bandwidth smaller. At present, the essential algorithms include Cuthill-McKee (CM) [23], Reverse Cuthill-McKee (RCM) [24], Approximate Minimum Degree (AMD) [25], [26] etc. As a variant of the CM algorithm, the RCM algorithm is considered to be one of the most promising low-cost heuristics for reducing bandwidth. RCM algorithm transforms matrix into adjacency matrix and node undirected graph from graph theory. The pseudo-vertex is used as the root node, the vertices in the graph are labeled in increasing order, and the vertices with the same distance from the pseudo-vertex are divided into one layer. The structural nodes are divided into several layers to form a tree structure, as shown in Algorithm 1.
Previously, Oliveira and Abreu [27] compared 7 search methods for pseudo-peripheral vertices, and verified by examples that George and Liu [28] algorithm is more suitable for symmetric sparse matrices. Nascimento Rodrigues et al. [29] realized the RCM algorithm of sparse matrix reordering by using OpenMP technology, and obtained up to 50% acceleration on the small matrix of millisecond time scale. Azad et al. [24] and Nascimento Rodrigues et al. [29] has done related work on distributed computing of RCM algorithms in the MPI-OpenMP environment. This section draws on previous research results and designs RCM algorithm applicable to heterogeneous computing platforms in combination with CUDA. The dimension of symmetric matrix A is n, and f i (A) is the first non-zero element in row i, which is: Due to the symmetry of matrix A, the bandwidth of each row , so the matrix bandwidth can be expressed as a formula: The set of all non-zero elements is: The undirected graph of A is expressed as where V is the set of vertices, E is the set of edges. Denote the number of vertices by n = |V |, m = |E| is the number of edges. Let d (v, w) be the distance between any two vertices v and w, the farthest path between the two points in the graph is represented as l (v) = max {d (v, w) |w ∈ V }. The root hierarchy of the vertex v ∈ V that V satisfies the partition of L (v): RCM repeatedly marks all adjacent nodes of the current vertex, as shown in Algorithm 1. It is essentially a breadthfirst-search (BFS) algorithm, makes the tree deep enough to reduce the number of nodes in each layer in a disguised way, so as to reduce the bandwidth of the matrix [30], [31]. It is based on the CM method to sort the nodes in reverse order and take the smaller of the sum of the column heights in the positive order and the reverse order.
How to select pseudo-peripheral vertex is the focus of RCM algorithm. Previous experience shows that the first vertex has a significant impact on the effectiveness of bandwidth optimization, and it is advisable to choose a vertex with the longest path and a deeper one. Because it costs a lot to find the most peripheral vertex, so there is a so-called pseudo peripheral vertex instead: the longest path depth of the pseudo peripheral vertex should be as close to the diameter of the graph as possible. Usually, any vertex is found as the root node in the vertex set V, and the corresponding tree structure is calculated to obtain the deepest vertex. This vertex is selected as the root node and the tree graph is reconstructed until the maximum depth no longer changes, which is used as the convergence judgment condition. The whole process is as Algorithm 2. Combined with CPU-GPU architecture, the sorting algorithm and pseudo peripheral vertex index algorithm in RCM are redesigned, and the parallel design of RCM algorithm is completed by using the idea of linear algebra, as shown in Algorithm 3. Take the sparse matrix A and the pseudoperipheral vertex r as inputs, and return the dense matrix R to represent the ranking of the RCM algorithm. The first step is to initialize the element in R to −1 and modify it after accessing the ith vertex. L cur and L next are vertex sets at the current and next BFS levels respectively, where L cur is called the BFS boundary of the current active vertex. The algorithm first marks the pseudo peripheral vertex R and inserts it into L cur .In the loop iteration, each time iterates through the nodes L next that L cur has not visited and marks the vertices in 8. Sort the vertices on the next boundary in ascending order to obtain the sparse vector 12. Skip to the next level and go back to step 4 13. End While L next , and explores all the vertices in a layer-by-layer iterative manner until the boundary L cur is empty.
At beginning of the algorithm loop, sparse vector L cur is assigned as the vertex label. Then, L next is obtained by multiplication of the current boundary layer L cur and the adjacency matrix A within a radius(select2nd, min). The overloaded multiply operation select2nd passes the parent class label to the child class, and the overloaded min function ensures that each vertex in L next becomes the child node of the smallest numbered vertex in L cur . FIGURE 1 is an example of how to mark the parent vertex number of the next level vertex in the form of SpMV (Sparse Matrix-Vector Multiplication). FIGURE 1a) is a BFS tree with root node a, the current level is the set {b, d}, and the level of the parent node to be calculated is the set {c, e, f}. FIGURE 1b) shows the SpMV corresponding to the tree view, and the right is the output vector, and the value in the element represents the number of the parent vertex. SpMV traverses the sparse vector to get the value corresponding to the non-zero value in the gray column and writes the parent node number in this position. After the record is completed, (select2nd, min) function ensure that any vertex can find its parent vertex number and keep the minimum value. For example, On the right in FIGURE 1, the current layer node e has two parent nodes with numbers of 3 and 2 respectively. Function (select2nd, min) ensures that the corresponding position is recorded as 2 when output. Once finding the point of the next level, the previously marked vertex is removed from it for the next iteration. Due to the existence of SpMV step, compared with the traditional BFS algorithm, the RCM algorithm based on linear algebra hides its complexity and is smooth to parallelization.
After obtaining the next level vertex set through SpMV function, the previously marked vertices have been excluded from L next . The sorting of vertices in L next is performed by the SortPerm function, which is based on the criteria of the parent vertex number and its own degrees of freedom. The vertices with the smaller parent vertex number are sorted first. Once the parent vertex numbers are the same, they are sorted according to their own positions. For example, in the figure above, the vertices e and f should be ranked first because of the parent vertex number, and they should be placed before c. And e is in front of f, so the sort of this layer after SortPerm function is {e, f, c}. This operation can adopt the idea of bucket sort, it can decompose the extremely time-consuming global sorting task into several small sorting tasks, so it is very easy to realize the parallel acceleration of CUDA when the matrix is large enough.
Buluc and madduri [30] and Azad et al. [24] and others pointed out in the relevant literature that SpMV(select2nd, min) and SortPerm are the most time-consuming two steps in the RCM algorithm, while the SpMV operation correlation library provided by CUDA has no relevant API function, so it needs to be specially designed for RCM.
The storage format of large sparse matrices has been introduced in the previous chapter. In view of the symmetry of the matrices generated by the finite element model, the CSR format and the CSC format are consistent, so the popular and easy to calculate CSR format is selected here. There are two ways of parallelization: in the calculation as shown in FIGURE 1, one method is to calculate the columns in the sparse vector for each thread. Due to the symmetry of the matrix, only the matrix data of the current row needs to be taken and put into the global output vector. The calculation steps of this method in each thread are relatively simple, but if each thread writes to the same position in the output vector, it will cause errors in the calculation results. Therefore, it can either be solved by atomic operation, or after each thread completes the calculation, an additional kernel function is called to achieve the calculation of the minimum value on each row (column). Another strategy is to calculate a row in matrix A for each thread. After obtaining the column index, judge whether the position is a non-zero number in the input sparse vector x; if it is a non-zero number, call the Min function or the ternary number judgment once, that is, each thread gets an element of the output vector. The disadvantage of this strategy is that each thread needs to determine whether the corresponding position in the input sparse vector x is non-zero after getting an element in A. In the unsatisfactory case, each thread needs to perform a log2nnzA judgment and address operation, which may cause bandwidth congestion due to the disordered access of each thread. However, because of the use of CSR format storage, the input sparse vector x is to be copied to each machine, and the adjacency matrix A can be split to each machine. After the output vector calculation of each computer point is completed, it can be summed up to one computing node for data combination, which is more suitable for Multi-machine and Multi-GPU architecture.

B. PARALLEL DESIGN OF PREPROCESSING CONJUGATE GRADIENT METHOD BASED ON INCOMPLETE CHOLESKY DECOMPOSITION
Incomplete decomposition is a more general preprocessing method, which is suitable for matrices with diagonally dominant [32]. The Cholesky decomposition is applied to the preprocessing equation of CG method. The standard preprocessing processes such as Algorithm 4. M is a nonsingular matrix of approximate A, which is called preprocessing matrix. The addition of the matrix changes the spectral properties of the original matrix, reduces the condition, ensures the aggregation of eigenvalues, and improves the convergence speed of the iterative method.
If M= I, PCG method degenerates to CG method. When the residual vector r j+1 = b − Ax j+1 meets the termination criteria of CG method, the iteration can be terminated. From the basic steps, it can be known that each iteration of the PCG method consists of one-time solution of the linear equations (step 3), two-time inner products (steps 4, 6), one-time SpMV (step 6), three-time triplet operations (step 5, 7, 8), and two-time scalar division operations (steps 4,6). Compared with the CG method, there is one more operation for solving the system of equations, that is, step 3. If the preprocessing matrix M is not selected properly in this step, solving the extra equations is also time-consuming. Even if the number of iterations is reduced, but the time of each iteration of PCG method is much longer than that of CG method, it can not guarantee that the total solution time of PCG method is less than that of CG method. Therefore, at the beginning of the construction of M, it should be clearly realized that the use of this preprocessing technology is only meaningful if it can reduce the number of iteration steps.
M is constructed by the incomplete Cholesky decomposition of symmetric positive definite matrix A, that is A ≈ M = LL T . However, the Cholesky decomposition of symmetric positive definite matrix A will introduce a large 77192 VOLUME 8, 2020 number of non-zero elements, making the sparsity of L worse than that of A. Under the premise of ensuring the symmetric positive definiteness of L −1 AL -T , the sparsity of L is used to force the elements at some positions to be 0, so that there is a formula: where R is the error matrix introduced by incomplete decomposition, L is the unit lower triangular matrix, and D is the diagonal matrix. Proper adjustment makes more zero elements in R to ensure that LL T is close to A. If M= LDL T is selected here, the corresponding pretreatment CG method is called Incomplete Cholesky decomposition conjugate gradient method, which is abbreviated as ICCG (Incomplete Cholesky conjugate gradient method).
The incomplete decomposition of the vectorization operation is only calculated once and can be completed before iteration.
In the dynamic analysis of elastic vehicle, L and D can be stored in memory according to the same storage format as A.
According to Algorithm 4, PCG parallel algorithm is mainly composed of SpMV, vector dot multiplication and vector arithmetic. The above three kinds of calculation can realize GPU acceleration by calling API functions [19], [20] such as cusparseDcsrmv, cublasDdot and cublasDaxpy. Considering that NVIDIA has provided a wealth of GPU internal parallel libraries, and the efficiency is usually higher than that of handwritten kernel functions, the memory level and basic algorithms of GPU are not introduced here. This paper mainly describes the design idea of the software from the perspective of the mixed parallel strategy within the node and the highly scalable multi-node communication architecture.

Algorithm 4 PCG Parallel Computing
1. Select x 0 ∈ R n and calculate p 0 Solving linear algebra Mz j = r j , Get residual vector z j 4. Get the correction factor β j = r jC1 , r jC1 / r j , r j 5. Get search direction vector for next iteration p jC1 = z jC1 + β j p j 6. Obtain the correction factor α i = r j , z j / p j , Ap j 7. Get the next iteration vector x jC1 = x j + α j p j 8. Obtain the residual vector r jC1 = r j − α j Ap j 9. End Do

III. DESIGN AND IMPLEMENTATION OF MULTI-GPU WORKSTATION CLUSTER ARCHITECTURE
The research background is that the small-scale computing cluster completes the strength check of key components in a limited number of milliseconds, and ensures the task decision-making under the condition that the control system does not diverge. Therefore, from the perspective of computing scale, the current mainstream computing cluster frameworks such as Hadoop MapReduce and Spark are not suitable for real-time computing or millisecond lightweight cluster interaction required in this paper due to the problems of programming language and architecture volume. Based on Java, the mechanism of managing memory through GC (garbage collection) mechanism affects the calculation stability and efficiency [33]. Zhou et al. [34] and Kaur et al. [35] have proposed some improvement schemes for this framework, Wakde et al. [36], Hazarika et al. [37] et al. tested the efficiency differences between Spark and Hadoop with different data sets in their literature, the research shows that both of them are competent in traditional counting and iterative solving problems, but they are not suitable for the calculation scenario of spacecraft component strength check in millisecond level in this paper. The API of MPI is not flexible enough, and it is easy to cause the whole system to crash when a single node fails, which is unacceptable in the space launch mission. Its use of communicator to connect process groups may greatly increase the running time of the algorithm [38]. Pellegrini et al. [39] demonstrated that the Boost based design is more flexible and faster in transmission speed through experiments; Fukuoka et al. [40] indicate that MPI has performance bottlenecks in multi-threaded applications. In millisecond computing intensive computing, the efficiency of OpenMP is much lower than CUDA. Guo et al. [41] proved that CUDA has a higher acceleration ratio than OpenMP, and proved this view in the application of target tracking; Sivanandan et al. [42] compared and analyzed the application of MPI, OpenMP and CUDA in heat conduction calculation, and the results showed that the efficiency of OpenMP is far lower than CUDA under the condition of single machine parallel. Combined with the above literature conclusions, it is a reasonable choice to design an efficient, lightweight and cross platform cluster communication architecture combining CUDA, Boost ThreadPool and Asio. The computing solution of multi GPU workstation cluster is generally divided into three levels: inter node parallel, intra node parallel and GPU parallel [43], as shown in Figure 2. It can be seen from the topology structure that the intra-node communication includes two parts, concurrent execution of CPU in the node and the parallel execution of GPU. In order to avoid data congestion, the Boost Asio based on threadpool is used in inter-node communication to complete the task of data transmission between nodes with a fixed number of threads. Based on the multi-GPU cluster framework, this section will elaborate the parallel algorithm flow on multiple GPUs, including pseudo-code and multi-GPU communication. The first part is the task allocation of multi GPU based on message bus, which describes the allocation model, data division and message bus design in detail; the second part is the asynchronous communication module based on Asio, which describes the connection process of the server and the topological relationship of each component.

A. MULTI-GPU TASK ALLOCATION OF THREAD POOL BASED ON MESSAGE BUS
It can be seen from the preamble of this section that, for the calculation intensive problem, the performance of code execution on GPU can be much higher than that of CPU by selecting appropriate parallel algorithm. At the same time, due to the limited number of memory and stream processors in a single GPU, now multi GPU cluster is used to achieve higher acceleration ratio [44]. In order to ensure the universality and extensibility, a message-driven component design inspiration is selected here, and the multithreading engine is realized by the threadpool to ensure the concurrent execution of tasks in different cores on the CPU. In the computing node shown in Figure 2, all GPUs are called by the engine module (class named Engng). In the sequence diagram shown in Figure 3, it can be seen that the software framework used in this paper is mainly composed of three categories: observer, engine and data communicator. The observer contains specific algorithm and is called by Engng. in the initialization stage, the observer registers messages with Engng, and after receiving the messages, Engng stores the messages in the message queue. Engng module, as a message bus, centrally manages the interaction between the computing class and the data communication class, and acts as an intermediary to send and receive messages, so as to reduce the coupling between algorithms and modules. This module is the key to asynchronous heterogeneous level parallel of CPU-GPU. In the initialization stage, it completes the creation of threadpool and the segmentation of calculation data required by each node. During the calculation, Engng finds and broadcasts the objects related to the message according to the message type. The data communicator class is responsible for managing data communication between computing nodes which called Inter-node cross communication. The data communicator that invokes this computing node contains Boost Asio, which is responsible for data interaction between nodes.
Once the observer completes the data division required by the calculation of each GPU in the node, it sends a message to inform that the operation has been completed, and Engng will put each corresponding function into the threadpool for asynchronous operation, and complete data reduction and message sending in the callback function. After the calculation in one computing node is completed, the data communicator calls the data in the observer and sends it to the master node asynchronously. The send completion message is referenced in the callback function to inform the engine that the data transfer step has been completed. The dotted line represents data input and the solid line represents data output. The most important steps in FIGURE 3 are task partitioning and data division on the CPU. Under the hardware configuration of 2-node and dual-GPUs per node, SpMV is taken as an example to illustrate the task assignment problem of multi GPUs based on asynchronous operation. After the bandwidth optimization in section 2.1, the sparse matrix has been changed into the following banded sparse matrix, which is expressed in the form of block matrix, as showed in FIGURE 4. Computing node 0 is responsible for calculating the first two lines, that is, the sub matrix A11-A23 is put to computing node 0 for calculation. Computing node 1 is responsible for the last two rows, which represents the task level parallelism between nodes. The two GPUs inside the node are respectively responsible for calculating any row to represent the parallel level within the node. When the matrix bandwidth is undersize, each GPU only needs to call the SpMV function once, as shown in FIGURE 4a). Further partitioning is needed when the matrix bandwidth is large, and the computation is carried out in the form of flow [45] to represent the internal parallel level of GPU, as shown in FIGURE 4b).
Note that the number of computing nodes is N , the current node is n, where n < N . the number of stream in each computing node is S n , and the number of GPU in each compute node is i n , where n<N. If SpMV is calculated in the n-th node, the S n -th stream on the i-th GPU corresponds to the submatrix A( m<n m=0 i m + i, s), that is, the GPU will calculate all the data in the row where the current submatrix is located. The advantage of this scheme is that it is easy to implement the sum operation 77194 VOLUME 8, 2020  of the Thrust library in a GPU, but the load imbalance will occur when the bandwidth of the original matrix changes greatly, and the hardware utilization is not easy to guarantee. Due to the emergence of P2P technology [46], different GPU memory can be directly connected through the PCIE bus, so this paper applies the task structure and adopts a more efficient multi-stream strategy, as shown in Algorithm 5.
Taking the current node n has s submatrix as an example, it can be divided into s streams. The device number of the s-th stream is s%i n , and the submatrix of the column number s/i n is calculated. In the first step, the parameters of each device in the calculation node are obtained first, and the submatrix is divided according to the size of the matrix to form a number of task structures. The task structure includes the host input data pointer, the device data pointer, the stream number, device number used by the current stream, and position index of current block matrix in primitive matrix. After getting the length of the input data in the task structure, the address space is allocated in the lock page memory asynchronously, and the task is delivered to the threadpool.
Step 5-8 complete the kernel function calculation according to the given flow and device number, and the callback function waits for the data to be returned.
Step 10 evaluates the sum of the vector by row number through Thrust [47].
For the above strategies, the following mathematical model can be obtained. Assume a total of M streams in each time step and process the data divided into N blocks. In each stage, the calculation time for unit data is represented by t i , the transmission time is represented by t trans−i . If step l takes the most time in the calculation and recorded as t l , the total execution time t total is as follows. The value of j in the formula is j ∈ [0, M ]∪ = l.
The expression of data processing capacity per unit time is as follows.
Pipeline delay is defined as the time when data transmission has been completed from the initial stage to the final stage: Assuming that the operation time of each flow is the same, expressed as t. The previous equation can be simplified as follows.
A typical example of linear stream is the pipelining problem. It is proved that the pipeline scheduling problem is not suitable for the application of stream, and that the stream should try to deal with the algorithm that the duration of each stage is similar and the data is far longer than the pipeline length.
In reference [48], the slender elastic vehicle is simplified as beam element to solve the vibration problem, it is decoupled into an independent calculation process in three directions, realizing the time overlap of memory replication and kernel function execution, as shown in FIGURE 5. While the first stream executes the kernel function, the second stream passes data into the GPU. Once the first stream VOLUME 8, 2020 returns data to the host, the second stream starts executing the kernel function and waits for another stream to return after the completion of the third stream of data into the operation. Compared with the serial execution in GPU, the overall performance can be further improved. Since the Kepler architecture, HyperQ technology makes it possible to run multiple CPU cores or threads to start tasks on a GPU, which improves the utilization of GPU in the form of multiple work queues and avoids pseudo dependence.

B. CLUSTER COMMUNICATION DESIGN AND OPTIMIZATION BASED ON BOOST.ASIO
In order to ensure the reliability and future expansibility of the strength checking calculation framework in the rocket flight mission, the communication module of the software framework in this paper should have the following characteristics. The data communication thread in the node should not affect the stability of the main thread, and the real-time computation is not affected by the communication block, so as to guarantee the computational stability of the node itself. In order to ensure the subsequent larger component strength check, it should have strong expansibility. Due to the need to communicate with multiple nodes and other file servers, should have a high communication efficiency. To meet the above requirements, this paper combines Boost.Asio with Boost.Threadpool. Asio is a cross-platform library that can be used on most operating systems and can support thousands of concurrent connections simultaneously. It provides a set of API that can support TCP socket, UDP socket and IMCP socket. If necessary, it can be extended at any time to support the required protocol. In the communication module flowchart shown in Figure 6, the server maintains a message bus, which contains the  shared_ptr of all TCP/UDP/Serial instances. The server listens asynchronously to a port, keeps listening after connecting to a client, and creates the session class to send and receive data. When other components send a notification to call the function Read(), and at the same time call the  run() function of io_context to start running the program. The deadline of asynchronous read operation is set as a time step, that is, when the data cannot be accepted in a time step, the callback function will close the communication class instance and delete the client from list. If the receive operation does not timeout, the relevant operation will be carried out, and the io_context will be suspended waiting for the next message reading. The logic of write operation is similar to that of read operation, but when calling asynchronous write function, it needs to consider whether the output queue has been cleared. If not, it is necessary to send the original data before the data transmission.
A basic message framework is given here, as shown in FIGURE 7. Where the dashed lines represent derived relationships and the solid lines represent associations such as invocation or instantiation. It can be seen from the figure that BaseEngng and BaseSession are both atomic components that derive different types of entity components. BaseEngng derived the communication classes ServerEngng and Com-monEngng, and maintained their own io_context instance and threadpool instance. BaseSession derive various instances of communication classes and realize polymorphism by calling the same function interface. CommonEngng is responsible for VOLUME 8, 2020 In the algorithm design, the calculation of element matrix and the parallel solution of linear equations are a series of steps repeated on a data set. When cluster computing is used, the data should be separated along one or more dimensions and distributed to different computing nodes. If there is no dependency between the data, it is a simple problem of data segmentation, which can maximize the potential speedup ratio. Unfortunately, there is data dependency among the calculation nodes when solving the linear equations, and the data stored among the nodes needs to be exchanged in each calculation step. Referring to the existing TCP/IP network characteristics and GPU video memory bandwidth characteristics, the implementation platform is homogeneous and has a single port all two-way communication connection, and the problem parameters are described as follows: N is the number of cells in each dimension. k is the number of cells in each group along the decomposition density.
P is the number of calculation nodes. t comp is the cost of one update in each cell. t comm is the communication cost between two computing nodes.
t start is delay caused by communication start.
Assuming that a total of P = N /k groups are mapped to this node in the one-dimensional decomposition, and N is divisible by k, the execution and communication costs of a single node at each time step are: In two-dimensional decomposition, each node processes k 2 cells, P = N 2 /k 2 . The execution and communication costs of a single node in each time step are as follows: Similarly, we can deduce the execution and communication cost of each node processing k 3 cells under three-dimensional decomposition According to the relationship between each dimension calculation and communication cost, the optimal threshold value of each dimension decomposition can be derived: At the beginning of calculation, N , P and K are known, so when the time of communication and calculation is taken, the optimal geometric decomposition method can be obtained.

IV. NUMERICAL EXAMPLES ANALYSIS A. TEST PLATFORM AND TEST SUITE
The purpose of this article is to realize the rapid calculation of the strength of aircraft structural components, and to provide a reference for the decision of subsequent flight missions and the reconstruction of guidance instructions. For the components that affect the flight attitude, the digital twin system should ensure that the aircraft system does not diverge during the calculation time of strength check and the subsequent decision-making time. The guidance and control system of aircraft may diverge in several periods. In order to ensure sufficient time for subsequent guidance reconstruction and mission decisions, a guidance cycle is selected here, that is, 25ms as the real-time judgment criterion.
The experimental platform is a computing cluster composed of 2 nodes, each of which includes Intel E3 1230 V5 with 3.4G Hz and two GeForce GTX1060 graphics cards. Due to funding constraints, Tesla GPUs with higher double-precision computing capabilities will be used in future tests. In order to illustrate the impact of bandwidth reduction on the speed of solving sparse linear equations and ensure generality, two components (C3D8 and 2D3N) in engineering applications and part of sparse matrices in sparse matrix library [50] are selected as test sets to verify the parallel algorithm of RCM and the parallel algorithm of linear equations in the next section. The selected test matrix attributes are shown in TABLE 1. C3D8 represents a model consisting of a three-dimensional hexahedron and eight-node unit, which is a load-bearing component in the body. The simplified heat-preventing component 2D3N is composed of a two-dimensional three-node element. It should be noted that neither of the above two types of units can accurately simulate non-linear problems. However, the main studio of this article completes the real-time strength check based on CUDA acceleration, focusing on completing the algorithm design and performance testing first. The initial work of the two models, from element division to total stiffness matrix assembly, is completed in ABAQUS, and output through .mtx file.
The above test cases include thin shell and truss structure of slender aircraft, simplified slender beam, truss structure applicable to complex aircraft interior, and thin plate structure similar to simplified wing, etc. Some of the typical structures are shown in FIGURE 8.

B. TEST RESULTS AND ANALYSIS
The experiment first validates the parallel RCM algorithm based on linear algebra, which shows the fastness of VOLUME 8, 2020 the method. Next, the overall solution time of different matrices is compared, and the components of the solution time are analyzed to illustrate the fastness of the parallel algorithm. Finally, the correctness of the algorithm is verified.
Taking RCM and the approximate minimum degree algorithm as examples, Cholesky decomposition is adopted for the positive definite matrix and QR decomposition is adopted for the general matrix. The time spent on CPU after bandwidth optimization is compared as shown in TABLE 2. It can be concluded from TABLE 2 that the bandwidth reduction is very effective in reducing the solution time of linear equations. Among them, the row number of matrix No. 12 is smaller than that of matrix No. 13, but the solution time is longer. The main reason is that the sparsity of this matrix is lower. In addition, RCM algorithm is not effective in accelerating the solution of linear equations in several examples. The reason is that the row number of these matrix is massive with poor sparseness. The above part of bandwidth reduction schematic diagram is shown in FIGURE 9. The first column represents the original element distribution, the second column is the matrix element distribution optimized by RCM algorithm, and the third column is the matrix element distribution optimized by AMD algorithm [25], [26], [51].
Considering that GPU-based parallel algorithm does not have a good acceleration effect on small-scale matrix, some test cases close to the structure of aircraft parts in the test set in TABLE 2 are taken as the comparison between CPU operation time and GPU time, as shown in FIGURE 10.
It can be seen from the above figure that for the matrix of bodyy4 and nasa4704, which have small dimensions and a small average bandwidth, it is not appropriate to use the GPU-based RCM parallel algorithm designed in this paper, which takes even more time than the CPU version. The data interaction between the CPU and the GPU requires extra time, and cannot allocate enough threads to hide the overhead of data transmission. For larger-scale examples such as nd24k, the RCM algorithm designed in this paper has great advantages, and the acceleration ratio ranges from 7-14. From the composition of computation time, Qsort and SpMV operation take up most of the time of RCM algorithm, and GPU has better acceleration performance for both algorithms. If a three dimensional node has six degrees of freedom, considering the time consumption of other calculation parts, the current acceleration effect can meet the real-time calculation of the aircraft finite element model with about 2000 nodes, but the matrix beyond this scale will not guarantee the real-time calculation with the step size of 25ms. Even so, the parallel bandwidth optimization algorithm can still be applied to the fault diagnosis of aircraft parts in limited time, which can save a lot of time for subsequent calculations.
In order to prove the performance of the algorithm and the architecture, the examples in TABLE 2 are used for time comparison. The serial program adopts Lapack algorithm package and Eigen library as contrast. By using partial parallel library and analyzing the time consuming of various linear equation solving algorithms, the algorithm with the least time-consuming on CPU is selected as the test benchmark. The stable double-conjugate gradient algorithm is not compared with other parallel iterative methods because of its large single-step calculation and low efficiency for symmetric positive definite matrix. Part of the test results of the test set are shown in FIGURE 11-FIGURE 12. The difference in the bandwidth optimization preprocessing time is due to the use of different bandwidth optimization algorithms.
The test results show that when the number of matrix rows is less than 1000, the CPU computing time is within 10 ms, and the conjugate gradient parallel algorithm based on CUDA has no advantage. Especially for small matrix such as bcsstk03, the efficiency of parallel algorithm is much lower than that of CPU due to the data transmission. However, when the number of matrix rows reaches 4000, the parallel algorithm gradually begins to show its advantages, as FIGURE 7 shows. For example, the acceleration ratio can reach 3 when solving sts4098, bodyy4 and some other examples. With the increase of matrix dimension, the time-consuming proportion of matrix analysis and decomposition steps in the total time of preprocessing algorithm is increasing, while the proportion of iterative solution time is decreasing. The reason is that LU and Cholesky decomposition are path dependent. Only after the previous row or column has completed the calculation can the next step be calculated. The available threads also change with the bandwidth, and the time-consuming increases sharply with the expansion of matrix dimension. In addition, the time-consuming results of two parallel preprocessing algorithms show that the preprocessing technology can effectively reduce the time consuming of solving linear equations. When the line equations are larger, the calculation acceleration is more obvious.
As can be seen from FIGURE 13, the pretreatment technology can effectively reduce the initial residual, significantly reduce the number of iterations, and is not prone to divergence. The difference between FIGURE 13b) and FIGURE 13d) examples is obvious. It can be seen from TABLE 2 that the condition numbers of both matrices are relatively large, so they are sensitive to the calculation truncation error in the iteration. However, the preprocessing technology improves the matrix attributes and computational efficiency by reducing the number of conditions.

V. CONCLUSION AND FUTURE WORK
In this paper, a large sparse linear algebra parallel bandwidth optimization based on RCM algorithm and parallel preprocessing conjugate gradient iteration method is proposed. A data segmentation model is established in a multi GPU Clusters, and the algorithm time-consuming test is carried out with some examples in the sparse matrix library. The specific contents are as follows: The parallel RCM algorithm based on graph theory is analyzed and designed. By using BFS and graph theory, the matrix is equivalent to undirected vertex graph, and the depth of tree graph is changed to reduce the bandwidth. The custom SpMV algorithm realizes the fast parent node marking. An example of the sparse matrix library of the university of Florida, which is close to the parts of the aircraft, is used to analyze the composition of the parallel bandwidth optimization algorithm and verify the rapidity of the parallel bandwidth optimization algorithm.
Aiming at the solution algorithm of the current mainstream large linear equations, the parallel iterative solution algorithm of ICCG is implemented, and the calculation amount of each iteration is analyzed. By comparing the test set with the traditional CPU parallel solution library, the time consumption of each calculation step under different matrix dimensions is analyzed in detail. The results show that the pretreatment technique can effectively reduce the number of iterations and reduce the solution time for conventional structural components. VOLUME 8, 2020 The asynchronous parallel architecture of multi-GPU clusters is designed, and the structural examples in the sparse matrix library of the University of Florida prove that the design method in this paper can be applied to the rapid calculation of the structural strength of aircraft parts.