An Out of Memory tSVD for Big-Data Factorization

Singular value decomposition (SVD) is a matrix factorization method widely used for dimension reduction, data analytics, information retrieval, and unsupervised learning. In general, only singular values of SVD are needed for most big-data applications. Methods such as tensor networks require an accurate computation of a substantial number of singular vectors, which can be accomplished through truncated SVD (tSVD). Additionally, many real-world datasets are too big to fit into the available memory, which mandates the development of out of memory algorithms that assume that most of the data resides on an external disk during the entire computation. These algorithms reduce communication to disk and hide part of the communication by overlapping it with communication on blocks of work. Here, building upon previous works on SVD for dense matrices, we present a method for computation of a predetermined number, $K$ , of SVD singular vectors, and the corresponding $K$ singular values, of a matrix that cannot fit in the memory. Our out of memory tSVD can be used for tensor networks algorithms. We describe ways for reducing the communication during the computation of the left and right reflectors, needed to compute the singular vectors, and introduce a method for estimating the block-sizes needed to hide the communication on parallel file systems.


I. INTRODUCTION
Large amounts of high-dimensional data is constantly generated from sensor networks, large-scale experiments, massive computer simulations, electronic communications, social-network activities and many others [1]. Most of the collected big datasets (such as imagery, spatiotemporal sensor data, etc.), are naturally high-dimensional, i.e., they are tensors [5]. High-dimensional datasets typically represent multiple concurrent latent processes (explosions, earthquakes, etc.) imprinting their signatures in the observable variables (temperature, pressure, density, etc.) in different dimensions (space/time). The different manifestations of these latent processes allow unique representation of the data. This representation requires unsupervised machine learning The associate editor coordinating the review of this manuscript and approving it for publication was Tomás F. Pena . by factorizing the multi-dimensional (tensor) dataset into factor matrices and a core tensor [18]. Well-known, classical TF are the Canonical Polyadic Decomposition (CPD) [15] and Tucker Decomposition (TD) [29]. In CPD and TD the factor matrices usually carry the wanted latent processes and structures in each tensor dimension. For tensors of a higher order, d (d = 10, 100, . . .), the classical TFs are not feasible since the memory and amount of operations grow exponentially with the order, d. In this case, the tensor networks provide a natural representation of the data and make the factorization possible [8].
One of the simplest tensor networks is Tensor Train (TT) [22], Figure 1a. The advantage of the TT compressed format over the classical TFs is the ability of TT to provide stable rank reduction, achieved through a process of consecutive tSVDs of several large matrices, Figure 1b. The optimization problem of TT often involves dense matrices with billions of entries. With tSVD, one has to extract a substantial number of singular vectors for better reconstruction [22]. Along with TT, most of the other tensor networks use a set of sequential tSVDs [7], therefore tSVD algorithms that are capable of operating on extra-large matrices are needed. Thus, the out of memory tSVD algorithms that can compute a predetermined number of singular vectors of matrices that cannot fit in the memory are important for TT decomposition of tensors that cannot fit into the available memory. Here, we propose an out of memory tSVD algorithm suitable for tensor network algorithms. Our approach relies on the existing works in [3], [17], [21], that use a two-phase approach to construct the tSVD suitable for dense matrices. The first phase reduces the input matrix, A into a product of Q L MQ R , where Q L and Q R are orthonormal and M is band diagonal. In the second phase, the band diagonal matrix, M is reduced to a product of U S V where S is a diagonal matrix with singular values, U and V are orthonormal. Finally, the singular vectors are computed by taking the products Q L U = U and V Q R = V .

A. OUR CONTRIBUTIONS
We implement an out of memory tSVD algorithm that computes both the singular values and the singular vectors, which can be used in out of memory tensor network algorithms. Our main contributions are as follows.
• We implement an algorithm to compute the left and right reflectors of the reduction to band diagonal phase that are needed to compute the singular vectors in out-of-core fashion as opposed to just computing the singular values as in [17].
• We develop caching strategies to reduce the number of input/output (I/O) operations specific to the access patterns of the left and right reflectors, thereby increasing the efficiency of the implementation.
• Our implementation uses a task based model in a shared memory machine. We analyze the scalability of the implementation with respect to the number of threads.
• We analyze the effects of parallel I/O on the overlap of communication and computation.

II. BACKGROUND
Approaches for singular value decomposition typically go through either a two phase or three phase approach [3], [17], [21]. The two phase approach implemented in LAPACK [3] reduces a general band matrix to bidiagonal form in two phases by repeatedly applying Householder transformations [16] to annihilate elements either to the right of the diagonal or below the subdiagonal. This method generates, for an input matrix A, a bi-diagonal matrix B, left and right reflectors Q L and Q R such that A = Q L BQ R . Singular values are obtained by reducing B to diagonal form and computing reflectors U and V such that A = Q L U S V Q R = USV . While the LAPACK implementation does compute the singular vectors of A, it is not capable of efficient out of memory computation as it requires O(N 3 ) word movements [17]. Out of memory algorithms have been developed for one sided factorizations [9], [11], [28], including QR factorization. [25] performs an out of memory SVD on talland-skinny matrices by first performing a out of memory QR such that A = QR. This is followed by an in memory SVD on the R matrix, which has the same singular values as A. Another out of memory attempt in [17] uses a three stage algorithm to compute singular values. The first stage is a general out of memory dense matrix reduction to band diagonal form using a tile algorithm. The second reduces the band to bi-diagonal form and the third reduces the bi-diagonal to diagonal form, which are the resultant singular values. The two stage reduction to bidiagonal form has been used for SVD in [13], [20], [21]. Demchik et al. [10] design an outof-memory randomized SVD algorithm for both dense and sparse matrices that uses a randomized matrix approximation technique to reduce the size of the input matrix before computing an SVD. Heavner et al. [14] address the related problem of a rank-revealing factorization as an alternative to tSVD for matrices too large to fit in the RAM. They describe an algorithm for a UTV factorization (introduced in [24]) based on randomization.
The authors in [17] suggest three strategies to either reduce communication with disk or overlap communication with computation, in the first stage, which consists of intertwined QR and LQ phases. The first of these strategies is to cache tiles from the lower right hand side of the matrix since these are the most used. The second strategy involves the fact that we can store the reflectors, q, which we use in the update of the trailing matrix in memory since these are read once per update. The last strategy overlaps communication with computation by reading the next tile, writing the previous tile and computing on the current one at the same time. The authors derive the formula b >= 3.2α β for the optimal tile size to hide the cost of communication where b is the tile size, α is the computational performance of the machine, and β is the speed of the external storage.
The reduction to band phase is followed by a reduction to diagonal phase, which uses the bulge chasing technique found in [4]. The work in [21] uses the same two step process without any of the strategies from above. Their work uses a task based model reducing of a band matrix to bi-diagonal form. Nevertheless, none of the attempts in the literature calculate the singular vectors, which we address in this paper and the consequent memory related problems in our out of memory tSVD implementation.

III. ALGORITHM
Our method consists of two phases.The first phase reduces a dense matrix A with tile size b to band form using the same tile scheme as in [17]. This is followed, in the second phase, with a sparse tSVD on the band matrix. Using this approach, the first phase reduces communication when compared to the single-phase bi-diagonal reduction approach implemented in LAPACK. Unlike [17], we also compute the left and right reflectors needed to compute the singular vectors. After the first phase, the entire matrix consists of zeros except for a diagonal band of elements of width b. In total, there are approximately N * b non-zero elements. With a small enough b, the entire matrix can fit in memory as a sparse matrix. We create an in-memory sparse matrix by reading the band tiles from disk and writing the non-zero entries to memory. This allows the use of sparse tSVD solvers that operate in-memory. We note that in the rest of the paper we refer to columns of tiles and rows of tiles as simply columns and rows respectively.
A. FIRST STAGE: REDUCE TO BAND DIAGONAL Algorithm 1 shows the procedure for transforming an input matrix, A, into a band diagonal form. All of the functions presented are available in LAPACK. As in [17], we store every matrix in column-major tile format. The tile size, b, can be chosen to minimize the runtime of the algorithm. For an N × N matrix with tile size b, the total number of tiles is u 2 , where u = N b . We first update the dense matrix A in a sequence of intertwined QR and LQ phases which zero out portions of the matrix, while including updates to matrices Q L and Q R , which are defined as the left and right reflectors that transform A into a band diagonal form. Matrices Q L and Q R are initialized as identity matrices.
The first of the QR functions is DGEQRT. We can use this to transform diagonal tile, A[i, i], into upper triangular form and to obtain Q, which we use to update the tiles to the right of A[i, i]. The upper triangular tile is obtained by setting for tiles to the left since these tiles are zero from the previous QR phases. Since our goal is to compute the left reflector Q L , we compute Similarly, the algorithm computes QR factorization on pairs of tiles using DTPQRT, which takes as input the diagonal

tile A[i, i] and a tile below it, A[i, j], and returns an upper triangular matrix, where A[i, i] is upper triangular and A[i, j] is zeroed out. This is followed by updates to the tiles in the same row as A[i, i] and A[i, j] as shown in Figures 2a and i2b.
Tiles of Q L in columns i and j are updated as well as shown in Figures 3a and 3b. LQ factorization follows a similar pattern to produce lower triangular tiles rather than the upper triangular tiles, Figure 2c and 3c. The QR and LQ phases are intertwined to produce the band shape.

B. SECOND STAGE: SPARSE SVD
In the second stage, we convert the band matrix into a sparse matrix with density approximately equal to b N and number of non-zero elements approximately Nb In this stage, we assume that Nb is small enough to fit in memory. This returns the top K singular values and vectors in U , S, and V such that Therefore, where U = Q L U and V = V Q R , both of which require a single matrix multiplication to compute. Since Q L and Q R can both exceed the main memory capacity, the algorithm computes the product by reading in tiles of the matrix Q and multiplying by tiles of in-memory matrix U or V .

IV. IMPLEMENTATION A. TILE CACHING
In this section, we describe caching strategies, which reduce communication cost. The authors of [17] cache the most accessed tiles in the lower right hand side of the matrix to reduce communication. We adopt a similar strategy for caching tiles of matrices Q L and Q R , where we store the leftmost columns and topmost rows respectively for each Q. Since these matrices access tiles differently than matrix A, for j = 1 to u do 6: for k = i + 1 to u do 10: for k = 1 to u do 12: 15: for j = i + 1 to u do 16: for j = 1 to u do 18: 19: for k = 1 to u do 24: each requires a different caching pattern. Q L accesses all tiles an equal number of times throughout the decomposition, u times in total, and accesses tiles column by column, as shown in Figure 3. Q R is similar except tiles are accessed row by row and the topmost row is never accessed because the LQ factorization is not applied to tile A[1, 1] but rather A [1,2].
For a square matrix, Q L is of dimension N × N , or with u × u tiles. Each tile is read u times and written u times for a total of 2u 3 memory calls. If we can store p columns of this matrix, we will only have to access the disk 2u(u − p) times during the first QR sweep as shown in figure 4. After the first update phase, column 1 is never accessed again,  which means the memory used to store column one can store column p + 1. In the next iteration, the external memory is accessed 2u(u − p − 1) times as shown in Figure 4b. In total, if p is at least 1, the disk is accessed where 2u 2 is required to read the tiles into memory and write them back out to disk. Note that caching the first column provides the most gains as this caches the one column that is used in the update of the rest of the tiles. Similarly, we can store p rows of Q R and only access the disk We can further reduce reads and writes to disk by caching the top row of the band matrix during the QR phase and the leftmost column during the LQ phase. As seen in Figure 2a and 2b, the topmost row is used in the update of all tiles below the row. Figure 5 shows that the top row is read u times and written u times for a single QR sweep, whereas tiles below are only read and written once. We can cache the tiles in this row during the update using DGEMQRT and DGEMLQT FIGURE 5. Amount of memory reads and writes for single QR sweep. Each tile below the first row is updated once and so must be read and written once. Each tile in the first row is updated U times. One of those times they are updated alone and U-1 times concurrently with each of U-1 tiles below it.
(updates with single tiles) and have these in memory during the two tile updates using DTPMQRT and DTPMLQT. Since DTPMQRT and DTMPLQT require two tiles, one of which is used multiple times in a single update of the trailing matrix, caching the one that is reused reduces the I/O by half. Lastly, since the two tile updates make up the bulk of the operations, this translates to approximately 50% reduction in I/O during the entire SVD. In order to simplify things, and since the LQ and QR phases are broadly speaking identical, the next sections will refer only to the QR phase.

B. DEPENDENCIES
We implement our model using task-based parallelism. In our experiments, this showed better performance than a single multi-threaded task model. We ensure that the algorithm is executed in a valid order tying tasks to dependencies. These include data dependencies, ensuring that two tasks never access the same data, and algorithmic dependencies where VOLUME 8, 2020 one task must always occur before the other.For example, the matrix multiplication updates must always come after a QR factorization task. Furthermore, since every task in a QR sweep uses the top row, as shown in Figure 2, no two tasks can update the same column.
In our model, every QR decomposition (blue nodes in figure 6) creates update tasks which must be performed in order so that overlap of computation and communication can occur. Once an update occurs (red nodes in figure 6), the next update task can proceed (nodes to the right of red nodes). In the standard algorithm, update tasks in the same row are independent. However, since each red node is also reading data for the next task, these tasks must be performed sequentially within a row. That is, update of tile A[i, k] depends on tile A[i, k − 1]. Each red node also allows tasks in the same column to initiate. This is because tasks in the same column use the same topmost tile. In practice, each update task contains subtasks for reading the next tile, updating the current tile, and writing the previous one. The I/O tasks are submitted to an I/O thread pool. Furthermore, a lock must be used for tiles to the left and right of the tile that is being modified, because those tiles are being read or written in parallel with the computation. Since the topmost row is shared between tasks, this necessitates that if a task is operating on three tiles in the same row, no other task can operate on tiles that are on a different row but in the same column as the these tiles. Figure 7a shows how these tasks would look like.
In order to increase the parallelism, in our model we assume that the top row and the leftmost column are held in memory, so these are fast reads and writes compared to disk I/O. These tiles are the common tiles shared by all update tasks in a QR/LQ sweep and should be released quickly. To do this, each I/O subtasks only reads and writes tiles that are not in the top row. During the compute subtask, the topmost tile necessary for the update is read, computed on, and written. That is, all the subtasks are combined into a single task for the shared tiles only. After this, its lock is released even when the I/O subtasks might not have yet completed. Figure 7b shows that this method increases the number of compute tasks that can be performed in parallel. We use this scheme for the final implementation.
We use the tasks construct in OpenMP to build a DAG representing dependencies between tasks. OpenMP then schedules each task on a thread according to these dependencies.
The DAG does not contain sequential sections. There are sections (e.g., for example, the update of rows), which must be performed sequentially. However, other threads can, and do operate sequentially in their own rows.
We experimentally validated the correctness and accuracy of the algorithm by running on multiple instances of matrices with known singular values. Our tests established the following: where the last two are to ensure that the singular vectors are orthonormal. We set to 10 −5 .

C. I/O OVERLAP
Each task from the section above can include a read, write, or both. All tasks send their reads and writes to a collection of I/O threads that perform the communication. This allows the threads to overlap the computation of a tile with the read and write of adjacent tiles.The model for I/O overlap is the same as in [17] and so we can use their estimate 3.2α β . We use the parallel file system to show improvement in the speed of the algorithm. Specifically, we show that this reduces the optimal tile size needed to overlap communication with computation to Here β is the memory bandwidth per IO thread and T IO is the number of threads devoted to IO. In the cases where I/O is the bottleneck, sacrificing computational threads for I/O threads is resulting in better performance.

D. SECOND STAGE: SPARSE SVD
To compute the truncated SVD of the band diagonal matrix, we employ ARPACK [19] through the C++ Armadillo library [26]. This uses implicitly restarted Lanczos iterations to efficiently construct the Krylov subspaces of our sparse band matrix. From this, it computes the tSVD, providing the K largest singular values and vectors.

E. MEMORY REQUIREMENTS
The first phase of the algorithm requires enough memory to store the shared tiles in A (top row in the case of the QR phase) and the shared tiles in either Q L or Q R . Per thread, each one needs space to store matrix Q, the factors needed to update the rest of the tiles (given by LAPACK as V and T), and the placeholder tiles for the tiles that are read, written, and computed on. V and T are used to compute the product of Householder reflectors H 1 H 2 . . . H b = H = I −VTV which are utilized for QR decomposition [3]. Q and the factors require three tiles. There are three placeholder tiles, one for each subtask. In total, this translates to 8 * b 2 (6T r + 2u) minimum required memory for the entire process. Here, T r is the number of threads.
For example, an SVD on a 128GB matrix, 36 threads, and b = 1024, the memory needed for the first phase is no more than 4GB. For the second phase, the memory used by ARPACK is no more than 8 * (2NK + O(K 2 )) [19], where K is the number of requested singular values and 8 comes from the number of bytes in a double precision number. The memory used to store the sparse matrix itself is less than ) is the number of non-zero elements in the band matrix and 3 comes from the usage of three vectors -the row index, column index, and data vectors. For the same matrix as above and 128 singular values, this amounts to approximately 3.3GB.

F. TILE SIZE SELECTION
In this section, we develop a heuristics to select an effective tile size that would result in good performance in the first phase of the algorithm. We first look to reduce the thread idle time. Let us look at the i-th iteration of Algorithm 1. The QR factorization of tile A[i, i] (in line 2 of Algorithm 1) has to be completed before all other steps, since all these steps depend on the value of Q. After that, however, multiple independent tasks can be done in parallel. Specifically, tile A[i + 1, i] can be factorized, and independently of that factorization all tiles A[i, k] for i + 1 ≤ k ≤ u can be updated sequentially with a single thread. Hence, a good tile size would result in having the time to update the i-th row with one thread to be roughly equal to the time to factorize tiles A[i + 1, i] to A[i + T r , i] as this would allow all threads to work in parallel updating their own row.
Let us denote the times for factorization and the update rate of a tile (essentially just a matrix matrix multiplication) by t f (b) and t m (b), respectively. Since u = N /b, then the above condition can be written as If only doing a single iteration of the algorithm, we heuristically find that t m (b) t f (b) can be approximated by γ where 1 2 < γ < 1. Since in further iterations of the algorithm the number of the tiles that we need to update decreases from u per row in subsequent iterations as the trailing matrix gets smaller. In practice, we find that setting γ = 1 4 gives us good results for the full aglorithm.
Next, we look at the communication time. This is given by [17] as b >= 3.2α β , where α is the computational performance of the machine (in flops/s), and β is the communication bandwidth (in bytes/s). For matrices larger than memory, we can choose the tile size as For matrices where most of the matrix fits in memory we can choose b ≈ N 4T r .

V. EXPERIMENTAL RESULTS
We ran most of our experiments on a two 18 core Intel Broadwell E5-2695V4 processors running at 2.1 GHz clock speed, 45MB shared cache, 128GB RAM, and peak performance of 1.2 TF/s. This node is connected to a parallel Lustre file system. We ran an additional experiment on an Intel Broadwell E5-2650V4 with 128 GB of RAM and 24 hyper-threads.

A. I/O OVERLAP
We ran our algorithm on three matrix sizes with N = 80000, N = 100000, and N = 120000 with 128 singular values and 500 × 500 tile size with varying number of computation and I/O threads. Together with the reflectors, these matrices are too large to fit in memory. Despite removing computation threads to increase I/O threads, computation time does not noticeably increase. We do not achieve complete overlap, however, we get massive gains to I/O speed. We can see that, from one to two I/O threads, the non-overlapped I/O time is reduced by more than a half. Parallel I/O reduces the optimal tile size needed to overlap computation with communication with up to four I/O threads by increasing the I/O bandwidth. With more than four I/O threads, we get more varied results. As shown in figure 8(a), having more I/O threads reduces communication time consistently, but for (b) and (c) we get either no noticeable reduction or, in the case of (b), an increase in communication time. Overall, however, for up to four I/O VOLUME 8, 2020  threads, this gives us more options in terms of choosing a tile size that reduces global execution time. The equation given in [17] for the optimal tile size to overlap computation with communication then becomes b >= 3.2α T IO β on parallel file systems, where β is the communication bandwidth for serial I/O.

B. TILE SELECTION
We analyze the effects of tile size on execution time of our code. We run our algorithm on square matrices of sizes N = 20160 and N = 45360 that fit in memory, and each run with one I/O thread. We also run our code on matrices of size N = 83160 and N = 110880 that do not fit in memory, and run with two I/O threads to lessen non-overlapped communication. The elements of the matrices are generated from random uniform distribution, and we compute the top 128 singular vectors and values. Note that, since the matrices are uniform random, there exists one very large singular value, followed by tightly clustered singular values. Since the implicitly restarted Lanczos method used by LAPACK tends to under-perform for these matrices, the execution time of the second phase of our method is an upper bound of what can be expected for most real world data. Figure 9 shows that, as expected, the second phase performs best for small tile sizes and the execution time increases as the tile size increases. We observe that, for all the matrices, relatively small tile sizes perform the worst for the first phase of the algorithm. The first phase time then stabilizes but continues to increase for large tile sizes. The reason that larger tile sizes are less effective is because they limit the amount of parallelism. If u < T r then at most u threads can work in parallel.
We compared the execution time of runs using a tile size using selected using our tile selection method and the tile with the minimum idle time in Table 1. Tile selection is modified so that only sizes b which are factors of N are chosen. We set b to the factor of N which is closest to max N 4T r , 3.2α β . We find that, as a fraction of the total time spent idling in phase 1, the estimated tile size is either close to the best tile size or the fraction spent idling is close the best idling time, with the exception of N = 20160. With smaller N , we notice the same issue. However, the rule holds for all larger matrices that we tried.
One final observation is that as the matrix size increases, the running time of the algorithm becomes increasingly dominated by the first phase running time. In table 2 we show that for most tile sizes which are optimal for phase 1 their global execution time is close to the optimal global execution time. This is because the execution time for the first phase grows faster than the execution time for the second phase when we fix the number of singular values, K.

C. STRONG SCALING
We analyze the effect of number of threads on the performance of the algorithm. We run the tSVD on a 32768 × 32768 matrix with 1, 2, 4, 8, 16, and 32 computation threads, 1 I/O thread, and tile sizes 128, 256, 512, 1024. Figure 10 shows that, when the tile size is 256, which is near the proposed optimal for 32 threads, we achieve 96% scaling efficiency with 32 threads; however, with all other sizes this reduces to approximately 50%. We also note that, with 4 threads, we achieve higher single thread performance than using only a single thread. This happens because we choose the internal LAPACK tile size at run time and these can have effect on the execution time.
ERIK SKAU received the B.Sc. degree in applied mathematics and physics, and the M.Sc. and Ph.D. degrees in applied mathematics from North Carolina State University, Raleigh, NC, USA. He is currently a Postdoctoral Research Associate with the Los Alamos National Laboratory (LANL), Information Sciences (CCS-3) Group. His research and technical interests include signal processing, matrix and tensor factorization, and machine learning.
GOPINATH CHENNUPATI received the Ph.D. degree from the University of Limerick, Ireland. He is currently a Computer Scientist with the Los Alamos National Laboratory (LANL), Information Sciences (CCS-3) Group. He works on high performance computing (HPC), performance modeling, natural language processing (NLP), deep/machine learning, high performance linear algebra, and so on.
BOIAN ALEXANDROV (Member, IEEE) received the M.Sc. degree in theoretical physics, and the first Ph.D. degree in nuclear engineering and the second Ph.D. degree in computational biophysics. He is currently a Scientist with the Los Alamos National Laboratory, Theoretical Division (T-1) Group. He is specialized in big data analytics, matrix and tensor factorization, unsupervised learning, and latent feature extraction.
HRISTO DJIDJEV (Member, IEEE) received the M.Sc. degree in applied mathematics and the Ph.D. degree in computer science from Sofia University, Bulgaria. He is currently a Computer Scientist with the Los Alamos National Laboratory (LANL), Information Sciences (CCS-3) Group. Before joining LANL as a Scientist, he worked as an Assistant Professor with Rice University and a Senior Lecturer with Warwick University. He is also a Research Adjunct Professor with Carleton University, Ottawa, ON, Canada. VOLUME 8, 2020