More Efficient Secure Matrix Multiplication for Unbalanced Recommender Systems

With recent advances in homomorphic encryption (HE), it becomes feasible to run non-interactive machine learning (ML) algorithms on encrypted data without decryption. In this work, we propose novel encoding methods to pack matrix in a compact way and more efficient methods to perform matrix multiplication on homomorphically encrypted data, leading to a speed boost of <inline-formula><tex-math notation="LaTeX">$1.5\times - 20\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mo>×</mml:mo><mml:mo>-</mml:mo><mml:mn>20</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="huang-ieq1-3139318.gif"/></alternatives></inline-formula> for slim rectangular matrix multiplication compared with state-of-the-art. Moreover, we integrate our optimized secure matrix arithmetic with the MPI distributed computing framework, achieving scalable parallel secure matrix computation. Equipped with the optimized matrix multiplication, we propose <italic>uSCORE</italic>, a privacy-preserving cross-domain recommendation system for the unbalanced scenario, where a big data owner provides recommendation as a service to a client who has less data and computation power. Our design delegates most of the computation to the service provider, and has a low communication cost, which previous works failed to achieve. For a client who has 16 million user-item pairs to update, it only needs about 3 minutes (in the LAN setting) to prepare the encrypted data. The server can finish the update process on the encrypted data in less than half an hour, effectively reducing the client's test error from 0.72 to 0.62.


INTRODUCTION
D ATA sharing has become a double-edged sword in the era of big data. As different data may complement each other, learning on combined data from multiple domains often yields better results than learning on a single source. This is especially the case for a small data owner in an unbalanced scenario. For example, a start-up vendor (client) would like to buy services from a big vendor (server) in order to make better recommendations for customers. However, in such a collaboration scenario, direct sharing of the original data is undesirable, not only because the data is a valuable resource for the owners, but also because it might contain sensitive user information. To address this issue, there arise several secure computation techniques, including HE and secure multiparty computation (MPC). For example, a line of works [1], [2], [3] use HE to evaluate neural network models, where model parameters or data are encrypted. Frameworks such as SecureML [4], ABY3 [5], and Spdz2k [6] can train machine learning models collaboratively, by using MPC techniques such as garbled circuits and secret sharing. Generally speaking, MPC approaches support richer functions than HE, but incur almost the same computation and communication costs to each party, and thus might not be optimal for the unbalanced scenario.
In this work, we focus on the optimization of HE-based matrix computation, which is a crucial part of most ML algorithms. Efficient matrix addition and element-wise multiplication (i.e., Hadamard product) is fairly straightforward, because most HE schemes support "packing" that allows a certain number of values to be packed into a single ciphertext, and any subsequent operation on the ciphertext is broadcasted to its inner values in a Single-Instruction-Multiple-Data (SIMD) vectorized manner. Secure matrix multiplication is more difficult to design because it involves both elementwise operations and cross-element aggregation. Recently, Jiang et al. [3] propose a highly efficient secure matrix multiplication method, by decomposing matrix multiplication into Hadamard product and linear transformation that involves only rotation, constant multiplication and element-wise addition. However, their method is optimized mainly for square matrices, and is not optimal for rectangular matrix multiplication in terms of both space and time, especially when the rectangular matrix is slim (e.g., many more rows than columns). As a result, some popular ML algorithms (e.g., Singular Value Decomposition (SVD)) that contain slim matrix operations cannot fully benefit from their method.
We propose optimized secure matrix multiplication methods that come in handy in the design of a secure unbalanced recommender system. Recommender systems predict the preference of users, by using existing "(user, item)" preference pairs, and have already been successfully applied in many popular applications (e.g., Netflix, Youtube, e-commerce), whereas cross-domain recommender systems require data or knowledge transfer among different data owners. Many crossdomain recommendation systems fit in the "unbalanced" scenario. These systems have inherent privacy problems, which prior works [7] acknowledge but fail to provide satisfactory privacy guarantee. As we will show, a major difficulty in designing algorithms for such a system is how to securely and efficiently solve an optimization problem that demands heavy matrix computation.

Our Contributions
We propose novel methods to pack data in a compact way and more efficient algorithms to perform rectangular matrix multiplication, which was not well addressed in Jiang et al. [3]. We provide detailed complexity analysis to show that our method is asymptotically better in both space and time. Experiments show that we achieve 1:5 Â À20 Â speedup for different rectangular matrix multiplication under various parameters (summary in Table 1).
We integrate our optimized matrix computation with the MPI distributed computing framework, enabling large-scale secure matrix computation. By carefully splitting the matrix and distributing work among the MPI nodes, we achieve almost optimal scalability. For example, the time cost of multiplication between a 16K Â 16K encrypted matrix and a 16K Â 16 encrypted matrix with 256 MPI nodes is roughly the same as that between a 1K Â 1K encrypted matrix and a 1K Â 16 encrypted matrix with a single node.
We propose uSCORE, an unbalanced Secure COllaborative REcommendation system based on secure sparse SVD optimization, using our matrix computation framework. The system greatly boosts the recommendation accuracy for the client, and the computation cost lies mainly at the server side, making it well suited for the unbalanced scenario. We conduct extensive experiments on real world datasets in various settings, showing that uSCORE achieves higher recommendation accuracy than prior works, and is orders of magnitude faster than general state-of-the-art MPC solutions. For example, secure training on hundreds of millions of encrypted preference pairs takes the server less than half an hour and costs the client only about 3 minutes in the LAN setting (in contrast, it will cost 68 hours for both parties using MPC), and reduces the mean absolute error (MAE) of the client from 0.72 to 0.62.
A pure MPC solution based on garbled circuits is proposed to factorize matrices and generate recommendations [9], whereas CryptoRec [10] suggests the usage of homomorphic encryption to host a service for privacy-preserving recommendation. Some works ( [11], [12]) reveal intermediate information (e.g., item or user similarities) that they argue to be acceptable. Shmueli and Tassa [11] propose a simple neighborhood solution that does not require iterative training but includes a third party who learns non-trivial intermediate information.
There are also works [13], [14] that add noise to the algorithms in order to preserve users' privacy, which was shown to degrade accuracy significantly [11]. The formal framework of differential privacy has also been investigated to produce private recommander systems that resist inference attacks [15], [16]. Some architectures ( [17]) are based on the federated learning framework whose privacy guarantee is under active research and debate [18].
Machine Learning Based on MPC and HE. There have been numerous practices of using secure computing techniques to solve machine learning problems. To securely evaluate neural networks, CryptoNets [1] and E2DM [3] are frameworks solely based on homomorphic encryption, whereas Min-iONN [19], Gazelle [20] and DeepSecure [21] employ multiparty computation techniques such as garbled circuits that are good at evaluating non-linear functions. All these frameworks target only at the prediction phase with trained machine learning models, while secure training remains a much more challenging problem due to the depth of computation. SecureML [4] introduce mixed protocols for training several machine learning models and provide the first privacy-preserving system for training neural networks. Due to improved efficiency of homomorphic encryption in the past years and its simple outsourced cloud application model, a line of works discuss inspiring methods for training machine learning models on encrypted data [22], [23], [24].
Secure matrix multiplication is a trending research topic due to its importance in secure computing applications, especially in machine learning. Wu and Haven [25] propose methods to perform secure inner product between two encrypted vectors, from which they construct secure matrix multiplication and apply them to statistical analysis that scales to encrypted datasets with four million elements. Some researchers also explore various packing methods to enable more efficient secure matrix multiplication [26], [27], [28]. However, these methods can compute only one multiplication between two matrices; in order to enable iterative matrix products, intermediate results have to be decrypted and re-encrypted. Halevi and Shoup [29] introduce the diagonal decomposition that enables efficient matrix-vector multiplication in ciphertext, and it inspires several important follow-up works [3], [20], including our block extension in this work. Juvekar et al. [20] proposes faster matrix-vector multiplications and convolutions by using a simpler packed additively homomorphic encryption scheme that provides only homomorphic additions and scalar multiplications, and their scheme offers excellent performance for scenarios where homomorphic multiplication of two ciphertexts is not needed, e.g., in their CNN inference system where party A holds plaintext network parameters and party B holds encrypted input images. Jiang et al. [3] propose by far the most efficient non-interactive secure matrix multiplication algorithms that work well when both matrices are encrypted and that enable iterative multiplication because the multiplication result maintains the same encoding format.

PRELIMINARIES
We let denote the element-wise product between vectors or matrices, hÁ; Ái denote the inner product between vectors, k Á k denote the l 2 -norm, and boldfaced 1 denote a vector of all 1s. We list the detailed basics of homomorphic encryption in supplemental material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TDSC.2021.3139318, and only mention necessary terminology here. Let Add, Mult, CMult, and Rot denote homomorphic elemement-wise addition, multiplication, constant multiplication and slot rotation. In the recent CKKS [30] scheme, rescaling is another important operation to mitigate the bit-length explosion problem of fixed-point plaintext.

Secure Matrix Multiplication
Jiang et al. propose efficient square matrix multiplication by using the packing technique in homomorphic encryption, and they suggest to extend this to large matrices by using block-wise operations [3]. This is by far the most efficient HE-based matrix multiplication algorithm that enables iterative multiplication.
In their basic scheme, they can perform square matrix multiplication R dÂd Â R dÂd . For a square matrix A 2 R dÂd and a vector a 2 R d 2 , let i : R d 2 ! R dÂd denote the row encoding map For two square matrices A; B, whose encodings are a; b, their matrix multiplication can be formulated as where U s ; U t ; V i ; W i 2 R d 2 Âd 2 are sparse linear transformation matrices that apply on vectors a and b. V 0 and W 0 are the identity matrix. To perform the linear transformation U Á v, they use the method from Halevi and Shoup [29]: let u ' ¼ (U 0;' , U 1;'þ1 , . . . , U d 2 À'À1;d 2 À1 , U d 2 À';0 , . . . , U nÀ1;'À1 ) denote the 'th diagonal vector of U, then we have where rðv; 'Þ denotes the left rotation of v by ' steps. Fig. 1 gives an example for various components used in the above square matrix multiplication.

Rectangular Matrix Multiplication
To handle the rectangular matrix multiplication R lÂd Â R dÂd , Jiang et al. stack ðd=lÞ copies of the left matrix in a vertical direction so that it also becomes R dÂd . Afterwards, they propose a more efficient algorithm that requires only OðlÞ Mult and linear transformations, instead of OðdÞ in the above square matrix multiplication. We observe that this stacking method is not optimal in all cases because it might waste a lot of packing slots. Âd that can also be treated as a block matrix with ffiffiffiffiffiffi ffi d=l p blocks. This actually wastes packing slots, which is inefficient when l is small and d is large. Intuitively, the optimal way would be to pack R lÂd in a single ciphertext, instead of ffiffiffiffiffiffi ffi d=l p ones. Part of our contribution is the proposal of a more compact encoding method that enables us to pack R lÂd in exactly one ciphertext and to perform the rectangular matrix multiplication in a novel and more efficient way. Moreover, it's unclear how their rectangular matrix multiplication could be extended to R dÂl Â R lÂd . One possible solution is stacking A horizontally and B vertically so that both becomes square matrices A and B, and computing AB ¼ ðl=dÞ Á A Á B. However, this solution introduces one extra constant multiplication depth.

Parallel Matrix Multiplication
Jiang et al. also extend their method to perform parallel matrix multiplication for g pairs of square matrices simultaneously, which we use extensively in this work. Let i g denote the 1-to-1 map between R s and ðR dÂd Þ g , where s ¼ d 2 g. For a vector a ¼ ða i Þ 0 i < s , the parallel encoding is defined as The elements of a with indexes congruent to k modulo g correspond to the kth matrix A k . This is also illustrated later in Fig. 4a. We can perform parallel matrix-wise l-step rotation by rða; g Á lÞ.
For two parallel encodings a and b, to perform matrix multiplication between g pairs of matrices, we need to first encode g copies of U s ; U t ; ðV i Þ 1 i < d ; ðW i Þ 1 i < d , by encoding g copies of each of their diagonal vectors. In other words, for 0 l < d 2 À 1, let € u ' ¼ ðu i Þ 0 i < s , then the copy parallel encoding is defined as where € u ' is the encoding of g copies of u ' . For example, when a diagonal vector is ð1; 0; 1; 0Þ and g ¼ 2, then its copy parallel encoding is ð1; 1; 0; 0; 1; 1; 0; 0Þ. We denote the corresponding copy matrices as: Next, to perform parallel matrix multiplication, we use a similar formula as Equation (2), except that we should use the copy matrices, and use rðÁ; g Á 'Þ to perform '-step rotation instead of rðÁ; 'Þ in the linear transformation. Note that when g ¼ 1, this is exactly the same as square matrix multiplication. We summarize the previous work for parallel matrix multiplication in Algorithm 1. (Encrypted) g square matrices of dimension d Â d: . . . ; B gÀ1 g; Output: (Encrypted) g square matrices of dimension d Â d:

Collaborative Filtering
We let R 2 S nÂm denote the observed rating matrix, where n is the number of users and m is the number of items, e.g., S ¼ f?; 1; 2; 3; 4; 5g, with '?' indicating a missing value. r ui 2 S denotes the rating given by user u to item i. We use another matrix Y 2 f0; 1g nÂm to denote the corresponding indicator matrix, with y ui ¼ 0 if r ui ¼? (user u has not rated item i), and y ui ¼ 1 otherwise. We use the bold letters r uÃ and r Ãi (y uÃ and y Ãi , respectively) to denote the uth row and ith column of R (Y , respectively). Normally, the rating matrix is sparse; for example, in the MovieLens 100K dataset [31], the density is 6:3%. In collaborative filtering, two categories of models are popular in the literature: neighborhood models and latent factor models, which we briefly introduce below.
In neighborhood models, we compute the pairwise similarities between items (item-based) or users (user-based). Cosine similarity and Pearson correlation are two commonly used similarity metrics. A missing value r ui is predicted as a weighted average of item i's neighborhood (item-based) or user u's neighborhood (user-based).
In latent factor models, for each user u, we extract a userfactors vector p u 2 R f ; similarly, we extract an item-factors vector q i 2 R f for each item i (f is usually quite small, e.g., < 20). A missing value r ui is predicted asr ui ¼ b ui þ p T u q i , where b ui is some baseline estimate that we will explain later. The model works by discovering the intrinsic correlation between a user and an item in some subspace. Latent factor models become popular thanks to their good accuracy and scalability.

System and Threat Models
We gives an overview of our system in Fig. 2. We are interested in the scenario when the rating matrix R is vertically split (split on items) and held by two parties A and B. We use m A and m B to denote the number of items held by A and B respectively (m ¼ m A þ m B ). Note that the same method could be used for the symmetric scenario when the matrix is horizontally split (split on users). The setting is generally known as the cross-domain recommender system [7]. Two dataset owners could benefit from each other by sharing knowledge learnt from their own datasets [32].
We focus on the unbalanced scenario where one party (Party B) has limited computation power and few ratings, widely-known as the data-sparsity problem or cold-start problem [33], and wants to improve the recommendation accuracy on its customers with the help of a service provider (Party A) who has substantially more ratings and powerful computation infrastructure. A pure MPC-based solution (e.g., garbled circuits or secret sharing) might not be an optimal choice due to its requirement of symmetric computation and communication power, and instead, we propose a mixed-protocol mainly using HE, with MPC for the division operation (which is not supported in HE) during the initialization.
We assume a semi-honest security model, where each party follows the protocol but might try to infer more information. In this model, we design our system so that no intermediate information should be revealed to any party, except the final recommendation results. Note that this provides stronger privacy than some state-of-the-art proposals that additionally reveal intermediate information, such as [11] with similarity matrix leakage, or the federated learning frameworks [17], [34] with gradient leakage. In our system, the party B that owns the sparser dataset also has the secret key for the HE scheme and can decrypt the final recommendation results sent by the party A.

OPTIMIZED SECURE MATRIX MULTIPLICATION
As we will see in Section 5, our secure collaborative filtering requires iterative rectangular matrix multiplication. We discuss some problems when using the previous scheme in practice, and propose a more general and efficient scheme for matrix multiplication. We believe it will be of general interest also for many other machine learning tasks based on iterative optimization algorithms.

Diagonal Block Matrix Multiplication
The problem in the previous scheme is partially due to the row-encoding method that requires to base the multiplication on square matrices: rectangular matrices are stacked to form square matrices, and parallel-encoded matrices only support square-matrix-wise operations but cannot be treated as a single slim rectangular matrix. Next we propose a scheme that enables more efficient rectangular matrix multiplication.

Matrix Sharding and Diagonal Block Encoding
Without loss of generality, let us consider a slim rectangular matrix B 2 R ðdgÞÂd (dg rows and d columns), and packing size s ¼ d 2 g, meaning it can be packed and encrypted into a Fig. 2. System overview. Both the server and the client are assumed to be semi-honest. The first step involves garbled-circuit computation in order to generate a baseline estimate that requires non-polynomial operations such as division. The complexity is OðnÞ. All following steps are computed with somewhat homomorphic encryption. The complexity of step 2 and step 4 is Oðnm B Þ, whereas the complexity of step 3 is single ciphertext. Now let us consider another matrix A 2 R ðdgÞÂðdgÞ , and the matrix multiplication A Á B. Using the previous scheme, this could be a rather tricky problem to solve when g > 1 because we have to stack B horizontally to create multiple sub square matrices (thus using multiple ciphertexts, Fig. 3a), which might not even be possible when s is not a perfect square number.
We also denote A ð'Þ k as the kth block of A ð'Þ , and B k as the kth block of B; in addition, let a ð'Þ and b denote the parallel encoding of A ð'Þ and B, respectively. Viewing A as a g Â g 'matrix' and B as a length g 'column vector', we can similarly express A Á B by combing inter-block rotation and block vector multiplication. Let B ðg;'Þ ¼ gðB; 'Þ denote the result of the l-step left rotation of blocks in B, and let A ð'Þ Å B denote the block-wise product of two block vectors, i.e., Þ, which can be computed by the parallel matrix multiplication. Therefore, we have We show a simplified comparison with the previous method in Fig. 3. Note that our block extension of the "diagonal" idea is independent from that in Equation (3), which is still necessary for performing A ð'Þ Å B ðg;'Þ , hence Equation (6)  , and b k denote the row encoding of B k . We omit the details and provide the following result: Therefore, rðb; 'Þ actually represents an incomplete inter-block rotation: g À ' blocks have the correct row encoding after rotation, whereas ' blocks have row encoding shifted to the left by 1. This is explained in Fig. 4a for an example of ' ¼ 1.
Basic Solution With a Repair Matrix. The problem becomes how to repair the incomplete rotation so that we can obtain B ðg;'Þ , instead of B ðr;'Þ . To explain the solution more clearly, we consider only one "damaged" block encoding b ðr;'Þ k (g À ' k < g). It turns out that a linear transformation U r Á b ðr;'Þ k with the following permutation matrix U r can rotate the encoding back to the right by 1 step for 0 i; j < d and 0 k < d 2 . We call U r the repair matrix.
To use the parallel operation, one still needs to perform a dummy linear transformation U e Á b ðr;'Þ k for 0 k < g À '. We call it "dummy" because these blocks are not damaged and requires no rotation, hence they are paired with the identity matrix U e . Therefore, to obtain a complete interblock rotation, the basic solution is to encode ðg À 'Þ copies of U e 's diagonal vector and ' copies of U r 's diagonal vector. In Fig. 4b, we show how U r and U e can be used to repair the incomplete rotation. Note that the computation shown on the right is an application of parallel linear transformation. Fig. 3. A simplified comparison for rectangular matrix multiplication. The 'Â' symbol always refers to matrix multiplication. Each unit is a square matrix R dÂd . This example shows the matrix multiplication R 4dÂ4d Â R 4dÂd , and the packing size is 4d 2 . Jiang et al. [3] would stack the rectangular matrix so that the right matrix becomes R 4dÂ2d , while our method employs the compact diagonal block encoding. Fig. 4. Obtaining a complete inter-block rotation for l ¼ 1. In (a), block B 0 is "damaged" after the rotation, which is fixed in (b) by using the repair matrix and adding another layer of constant multiplication. In (b), the non-zero diagonal vectors have been highlighted, both in the transformation matrix and in the encoding.

Mixed Copy Parallel Encoding
However, there is a drawback of the basic solution: the "repairing" linear transformation increases the multiplication depth by one extra plaintext multiplication, leading to a total depth of 1 Mult and 3 CMult for the diagonal block multiplication. We further optimize the solution so that it has the same multiplication depth and even higher efficiency.
The idea is combining the repair matrix with the copy parallel encoding matrix. We give the following result and provide a complete proof in supplemental material, available online.
are the copy matrices, and U t ðlÞ is a mixed copy matrix whose ith diagonal vectorû t ði;'Þ ¼ ðu x Þ 0 x < s is defined by mixed copy parallel encodinĝ Moreover, for a certain ' 2 ½0; gÞ, there are only 2d non-zerô u t ði;'Þ , which areû t ðdk;'Þ andû t ðdkÀ1;'Þ , for 0 k < d. Therefore, we need two types of rotation: (1) for each ' 2 ½0; g À 1, we perform b ðr;'Þ ¼ rðb; 'Þ to obtain the incomplete block rotation; (2) in linear transformation U t ðlÞ Á b ðr;'Þ , for i 2 ½0; d 2 À 1, we perform rðb ðr;'Þ ; g Á iÞ to obtain the block-wise rotation. We summarize this procedure, named diagonal block matrix multiplication, in Algorithm 2. return ct:AB The same encoding method can be devised to perform R dÂðdgÞ Â R ðdgÞÂðdgÞ , except that in each iteration we should rotate both the blocks of the left matrix and the blocks of the diagonal block vector of the right matrix. Let A 2 R dÂðdgÞ and B 2 R ðdgÞÂðdgÞ . We directly give the result here (detailed proof in supplemental material, available online): Theorem 2. Suppose A 2 R dÂðdgÞ and B 2 R ðdgÞÂðdgÞ , then Both mixed copy matricesÛ s ðlÞ andÛ t ðlÞ have 2d non-zero diagonal vectors. We list the high-level procedure in Algorithm 3. We can also conveniently implement the general form of outer product R dgÂd Â R dÂdg (same as the aforementioned R dÂl Â R lÂd with different symbols). We summarize it in Algorithm 4, where A 2 R dgÂd ; B 2 R dÂdg and whose output is the diagonal block encoding of matrix A Á B 2 R ðdgÞÂðdgÞ . return ðct:C ðlÞ Þ 0 l < g

Complexity Analysis and Evaluation
Assume that s ¼ d 2 g, where s is the number of packing slots. We sum up the reasons for the inefficiency of [3] as follows: (1) In order to compute R dÂðdgÞ Â R ðdgÞÂðdgÞ (similar for R ðdgÞÂðdgÞ Â R ðdgÞÂd ), they have to stack the left matrix vertically ffiffi ffi g p times so that s ¼ ðd ffiffi ffi g p Þ 2 , and thus use ffiffi ffi g p ciphertexts to hold the stacked matrix, while we use only one ciphertext; (2) In order to compute R ðdgÞÂd Â R dÂðdgÞ , they have to stack the left matrix horizontally and the right matrix vertically ffiffi ffi g p times, and thus use 2 ffiffi ffi g p ciphertexts to hold the stacked matrices, while we use only two ciphertexts. Moreover, as mentioned in Section 3.1.1, Jiang et al. need to use the square matrix multiplication that uses blocks of dimension d ffiffi ffi g p Â d ffiffi ffi g p , instead of the rectangular matrix multiplication.
For the clarity of presentation, we analyze the complexity in terms of the number of calls to different secure matrix multiplication algorithms: HE-MatMult (square matrix multiplication), HE-RMatMult (rectangular matrix multiplication [3]), and HE-MatMultParallel (parallel matrix multiplication). Note that these algorithms have different internal complexity in terms of the number of additions, constant multiplications, rotations, and multiplications, according to the previous analysis [3].
We implement the methods of [3] and our algorithms by extending Microsoft's open-source SEAL library for homomorphic encryption [35]. Experiments in this section are performed on a server with an Intel Xeon Platinum Core rated at 2.5 GHz. Throughout this work, we always assume Mult consumes a 30-bit scale, and CMult consumes a 15-bit scale, and thus one matrix multiplication consumes a 60-bit scale (common scale settings to maintain reasonable precision [36]). We summarize our space and time complexity analysis and experimental evaluation result in Table 2. We test three sets of parameters here: Small parameters: s ¼ 2 12 The large settings were not considered in Jiang et al. [3]. All settings use 170-bit ciphertext modulus to achieve at least 128-bit security level according to the recommendation of the on-going standardization effort for homomorphic encryption [37], which is estimated to protect against known attacks of the LWE problem [38]. The modulus is decomposed into 4 small moduli of bits f50; 30; 30; 60g using RNS, where the last modulus of 60 bits is a special prime in the library. Our scheme outperforms the previous work in both asymptotic analysis and experimental evaluation, and uses less ciphertexts. With small parameters, we achieve 1:5Â speedup for R 256Â256 Â R 256Â16 , and 4:5Â speedup for R 256Â16 Â R 16Â256 . With large parameters (d ¼ 16; g ¼ 256), the improvement is close to 20 Â . It might seem that using a smaller packing size s could diminish the improvement of our scheme, but in practice, iterative algorithms with matrix multiplication require large modulus to handle several iterations, thus they require the use of large s to guarantee 128-bit security.
To give an intuition about the observed improvement, we analyze the difference between HE-RMatMult and HE-Mat-MultParallel. We focus on the number of four operations:  Tables 2 and 3), we summarize the complexity for HE-RMatMult(R ðd ffiffi g p Âd ffiffi g p Þ ; R ðd ffiffi g p ÞÂd ) and HE-MatMultParallel Table 3. We notice that HE-MatMultParallel has a lower complexity than HE-RMatMult. Although Mult is the same, the difference of other operations widens the gap between the running time of HE-RMatMult and HE-MatMultParallel. Whereas for R ðdgÞÂd Â R dÂðdgÞ , the difference is even larger because they have to use the less efficient square matrix multiplication. In addition, the improvement gets larger as g becomes bigger, because the complexity of HE-MatMultParallel is independent of g. When g ¼ 1 or 2, our method is as efficient as Jiang et al. [3], but these corner cases barely satisfy the assumption of slim rectangular matrix.

SECURE SPARSE SVD OPTIMIZATION
In this section, we propose a novel framework uSCORE to securely perform collaborative filtering between two parties, without leaking any intermediate information. Our method optimizes a latent factor model in the encrypted domain because of its better accuracy compared to neighborhood models in practice, which will be shown in evaluation (Section 6). In Section 3.2, , the factors p u and q i are usually learnt through SVD. However, conventional SVD does not work well in collaborative filtering because the rating matrix is sparse. State-of-the-art methods of extracting the latent factors are designed based on sparse SVD that solves the following optimization problem through gradient descent [39]: where b ui is the baseline estimate. The challenges in securely solving this problem are two-fold: 1) first, calculating the global baseline estimate across two datasets inevitably requires division operations which are not supported by HE; 2) second, the optimization requires iterative large matrix arithmetic. Garbled circuits (GC) can conveniently perform the non-linear (division) operations, but requires high online bandwidth for iterative algorithms over large datasets and continuous two-party interaction. In the unbalanced scenario, it incurs too much overhead for the party having much less data and weaker computation power. The philosophy of our design is to exploit the strength of both tools, by using GC to solve a sub problem that is independent of data sizes and using HE to optimize iteratively.
Since the GC-based computation is almost a direction application of exising libraries (e.g., EMP toolkit [40]), we focus on securely computing the SVD optimization in this article, and leave a two-party protocol for computing the baseline estimate in the supplemental material, available online. We define Equation (11) as the cost function Jðp Ã ; q Ã Þ, and the prediction error e ui ¼ r ui À b ui À p T u q i . The gradients are While it is theoretically correct to directly optimize based on the standard gradient descent, efficient vectorization is needed when the information is encrypted.

Optimization via Diagonal Block MatMult
We show how the optimization can be expressed in the paradigm of our proposal. First of all, let us express the problem in matrix forms. Let B 2 R nÂm contain b ui , E 2 R nÂm contain e ui , P 2 R nÂf contain p T u , and Q 2 R fÂm contain q i . Then, we have The matrix arithmetic we need is: i) PQ: R nÂf Â R fÂm ; ii) EQ T : R nÂm Â R mÂf ; iii) P T E: R fÂn Â R nÂm ; iv) transposition 1 Q T and P T ; v) element-wise addition and multiplication. We observe that in practice, f is usually a small number (e.g., 16). Regarding the encoding, we set d ¼ f and g ¼ s=f 2 . In order to avoid confusion (and to understand our distributed computing design in Section 5.3), we define the following dimension hierarchy for a matrix: h0: original dimension with scalar values as units; h1: the dimension with d Â d square blocks as units; h2: the dimension with ðdgÞ Â ðdgÞ blocks, or ðdgÞ Â d blocks, or d Â ðdgÞ blocks as units, depending on the algorithm we use from Section 4. We summarize the dimensions and necessary ciphertexts for each matrix in Table 4. Without loss of generality, we assume that dg divides n and m. We could pad zeros to satisfy such a constraint. The matrix arithmetic in Equation (13) could be performed in a block style in the h2 dimension. In addition, we use the Nesterov Descent for optimization and minimize the multiplication depth to 3 Mult and 4 CMult for one iteration.

Initialization via Transfer Learning
Without good initialization, the iterations needed for achieving good accuracy are too large for reasonable homomorphic encryption parameters. For example, when testing on the MovieLens 100K dataset, we discover that it requires at least 100 iterations to converge for various learning rates and momentums, which is far beyond the capability of a somewhat homomorphic encryption scheme (unless we resort to intensive uses of the expensive bootstrapping). To maximize the accuracy within a limited number of iterations that can be contained in the somewhat homomorphic encryption scheme, we use the idea from transfer learning by initializing the variables P and Q with pretrained values, instead of random ones. In fact, this technique is not completely novel in collaborative filtering; it was successfully applied to improve the accuracy of learning on sparse data, by transferring knowledge from different domains or datasets [32].
In the preparation phase of our system, the two parties A and B independently train the same problem in Equation (11) on their respective datasets until convergence. Afterwards, A has ðq i Þ 0 i < m A , B has ðq i Þ m A i < m , but A and B have different ðp u Þ 0 u < n . We can directly use q i from each party to initialize Q for the combined secure training stage. For P , we could use a weighted average to merge the pretained results from both parties: P ¼ w Á P A þ ð1 À wÞ Á P B , where P A and P B are the pretrained factor matrices from A and B, and w 2 ½0; 1.

Distributed Matrix Multiplication
In the large-scale experiment of Section 6, we use a rating dataset of dimension 16384 Â 16384, and homomorphic computation takes too much time on a single commodity machine. In this regard, we integrate our large secure matrix computation into a popular distributed computing framework, message passing interface (MPI).
We describe our distributed secure matrix multiplication below in the h2 dimension of the matrices. To better explain the solution, we take the matrix multiplication EQ T as an example (Fig. 5). The matrix E has a dimension n dg Â m dg , and 1. There are errors in [3] for transposition. We give the correct version in supplemental material, available online.
we assign the unit ði; jÞ (0 i < n dg , 0 j < m dg ) to a virtual node with id vid i;j . In addition, node vid i;j keeps a copy of the ði; 0Þth unit of Q T . The virtual node could be an independent machine, a computing core, or even a thread. Each node computes its part of the result by using these two units. Afterwards, we have the final step of aggregating the partial results in the same row. This step involves only relatively efficient homomorphic addition, and can be performed by setting a virtual node in each row as the master who aggregates the partial results sent by other nodes and then broadcasts the result. Other types of large matrix computation (e.g., PQ, element-wise multiplication, constant multiplication) can be deduced in a similar way.

EVALUATION
We instantiate the homomorphic encryption scheme with CKKS [30] in Microsoft's open-source SEAL library [35]. For the parameters, we choose the polynomial degree N ¼ 2 15 that provides s ¼ 2 14 . To provide at least 128-bit security, we use an initial ciphertext coefficient modulus q of 855 bits (including a 60-bit special prime). In our setting for the rescaling, 1 Mult consumes 30 bits and 1 CMult consumes 15 bits. Therefore, computing the global baseline estimate (depth of 2 Mult and 3 CMult) consumes 105 bits. The remaining modulus size could contain 4 iterations, with each iteration consuming 150 bits (depth of 3 Mult and 4 CMult).

Accuracy Comparison
To evaluate the accuracy, we use a common metric in the literature, mean absolute error (MAE), defined as The test MAE is calculated on party B's test dataset. We compare the test MAE of uSCORE with several state-of-the-art approaches, including: (i) training solely on party B's own dataset with an item-based neighborhood model (IBNM-1) [41] or with a latent factor model (LF-1) [42], (ii) training across two parties with an item-based neighborhood model (IBNM-2) [11], and (iii) training across two parties with a latent factor model as proposed by CryptoRec [10]. CryptoRec proposes the first efficient two-party latent-factor approach, with the limitation that it re-trained for only one iteration due to a lack of efficient packing and secure matrix multiplication at that time.
Although the first re-training iteration leads to a big drop in MAE, we will show that it pays off to train for more iterations in order to achieve optimal accuracy (also acknowledged in [10]- Fig. 5).
Datasets and Hyper-Parameters. We evaluate the above methods on three widely-used datasets in recommendation systems. The MovieLens 100k dataset [31] (ml-100k) contains 100,000 ratings (1-5) from 943 users on 1,682 movies. For the Netflix [43] (movie rating) and Goodreads [44] (book rating) dataset, we select the first 1,024 users who have given more than 20 ratings for the first 2,048 items. Necessary information for these three benchmark datasets are provided in Table 5. We split each dataset into two parts, with party A holding items from 1 to 1,024, and party B holding the rest. We also randomly pick 20% of party B's ratings as the test set. For uSCORE, we choose learning rate a ¼ 0:005, momentum m ¼ 0:95, and regularization parameter ¼ 10 by a grid search. The other models are trained in their suggested settings, by searching for better hyper-parameters if necessary.
The results are shown in Table 6. After computing the global baseline estimate and initialized with pre-trained P and Q, uSCORE continues training for 4 iterations. We observe that the two-party methods (IBNM-2, CryptoRec, uSCORE) generally outperform their single-party counterparts, due to the integration of more data from a second party. Moreover, uSCORE performs best for all three datasets among the two-party methods, and demonstrates that more than one iteration is indeed necessary to achieve optimal accuracy for various datasets. Nevertheless, we should point out that IBNM-2 [11] does not satisfy our privacy requirement because it resorts to a third party for improving efficiency and reveals intermediate information. Last but not least, uSCORE achieves almost the same accuracy between the secure and plain (unencrypted) versions across all three datasets, showing negligible effect introduced by the encrypted computation in this application scenario.

Efficiency Comparison
In this section, we conduct extensive experiments to show the efficiency benefits of using uSCORE compared to the pure  MPC-based approach SecureML [4], which remains state-ofthe-art for two-party privacy-preserving machine learning. We implemented both variants of SecureML at 128-bit security: The one extending EMP-toolkit [40] for COT-based multiplication triple generation (OT-SecureML), and the one using 3072-bit DGK for LHE-based triple generation (LHE-SecureML). We use ABY's [45] open-source DGK library, and implement the optimizations discussed in ABY and SecureML for reducing COT's communication costs. The linear algebra is built on the Eigen library [46]. We didn't compare our efficiency with CryptoRec [10] because their algorithms were not vectorized and would incur hundreds of times more overhead than ours in a direct comparison (both computation and communication). A fair comparison would require a redesign of CryptoRec, which might not be feasible.
Environmental Setup. Our unbalanced scenario is instantiated with a 22-core server and a single-core client, all rated at 2.5 GHz. Our optimized secure matrix multiplication is parallelized with 22 threads, while everything else is executed in a single thread. For SecureML, the computation is parallelized at the server whenever possible (mainly homomorphic operations in our implementation). The experiments are replicated under two network environments: (i) the LAN setting has a bandwidth of 10 Gbps and 2ms RTT latency, and (ii) the WAN setting has a bandwith of 40 Mbps and 40ms RTT latency. All methods are run for 4 iterations on a sub dataset of size 1024 Â 2048 taken from MovieLens 20M, and the dataset is split into two parts as above.
From Table 7, the client in uSCORE only spends 10 seconds in the LAN setting and 246 seconds in the WAN setting, whereas the server takes care of most computation. This is superior to other listed MPC-based methods in two senses: (i) The client can go offline after some light-weight initial computation. In contrast, MPC based methods require the client to keep online for 24 hours in the WAN setting. (ii) From a business point of view in practice, it enables convenient deployment for recommendation as a service. We should emphasize that a 22-core server is a very conservative estimate for an unbalanced cloud-computing scenario; in reality, the server would be usually more powerful, leading to more prominent performance advantage for uSCORE. This trend will be exemplified further in the large-scale experiments. The important observation is that, the server can scale out the computation, but not the bandwidth of the link with the client. Moreover, the SVD optimization with our proposed diagonal matrix multiplication takes significantly less time than that with the method from Jiang et al. [3], which coincides with our benchmark in Table 2 because matrix multiplication is the bottleneck.

Large-Scale Experiments
We conduct more extensive experiments on a larger dataset of dimension 16384 Â 16384, sampled from MovieLens 20M. We explore the following case in this section: party A has a dataset of dimension 16384 Â 15360 and party B has a dataset of dimension 16384 Â 1024. We randomly split party B's dataset into a training set and a test set in three ways: training (20%)/test (80%), training (50%)/test (50%), training (80%)/test (20%), leading to different densities of B's training sets. In the preparation phase, we try various hyper parameters to obtain the best MAE for party B's test set. During the secure computation, all experiments use learning rate a ¼ 0:002, momentum m ¼ 0:95, and regularization parameter ¼ 10. The results are shown in Fig. 6a. Similar to the small-scale experiment, we observe a trend of decreasing MAE by using our secure merge training algorithm in all settings.
After the phase of computing the global baseline estimate, the above large-scale experiments are conducted by party A on a cluster of 256 nodes, by using our distributed secure matrix multiplication (see Section 5.3) in the MPI framework implemented in C++. We view the node distribution as a 16 Â 16 node grid. Each node handles a sub dataset of dimension 1024 Â 1024. When performing matrix multiplication EQ T (R 16384Â16384 Â R 16384Â16 ), nodes vid i;0 (0 i < 16) will be the master nodes for the corresponding rows. When performing matrix multiplication P T E (R 16Â16384 Â R 16384Â16384 ), nodes vid 0;j (0 j < 16) will be the master nodes for the corresponding columns. We show the running time for all 256 nodes in each iteration of the algorithm in Fig. 6b. The communication of intermediate results and the aggregation take little time compared to the secure block matrix multiplication. The client spends less than 3 minutes under this large setting. The iterative SVD optimization starts at a relatively In the small scale experiment, the server and the client each holds a 1024 Â 1024 dataset. In the large scale experiment, the server holds a 16384 Â 15360 dataset, and the client holds a 16384 Â 1024 dataset. All the datasets are sampled from MovieLens 20M, the rows are users, and columns are movies. '*' means estimated via extrapolation.
high running time, but decreases rapidly in each iteration. This is due to the rescaling procedure that shrinks the ciphertext. The running time of uSCORE and a linear extrapolation for the MPC methods are provided in Table 7.

Discussion
Bootstrapping. We retrain for a small number of iterations in this evaluation to demonstrate the effectiveness of improving recommendation accuracy by using our framework. When experimenting on plain datasets, we observe that the accuracy hardly improves after 4 or 5 iterations. Nevertheless, recent advances in bootstrapping can be applied in order to enable more iterations. Information Leakage. Generally, in privacy-preserving machine learning, solutions based on MPC or HE might suffer from statistical inference attacks on the results. In uSCORE, although only recommendation results are revealed to a client, they are indeed computed by integrating information from the server. Similar problems also exist in other privacypreserving recommendation systems [9], [10], [47]. It is interesting to quantify how much information is leaked through the results in future work. To mitigate the problem, an orthogonal approach would be to apply differential privacy in the system [15], [16].

CONCLUSION
We introduced a more efficient scheme for secure matrix multiplication based on homomorphic encryption, especially for rectangular matrix multiplication. The scheme works by packing matrices in the novel compact diagonal block encoding, and by exploiting certain features (e.g., incomplete inter-block rotation) in the linear-transformation decomposition of matrix multiplication.
We integrated our scheme in a large-scale secure recommender system that solved the sparse SVD optimization on encrypted data. The system was tested in a distributed computing framework on encrypted datasets containing hundreds of millions of elements, and showed practical efficiency and effectiveness in improving recommendation accuracy.  6. Results for the large-scale experiment in our system. (a) shows the MAE on B's test sets when using three training sets of different densities: 0:017%, 0:041%, 0:066%. We observe similar results under various settings, demonstrating the effectiveness of our solution (e.g., MAE drops from 0.72 to 0.62 when B's training set density is 0:017%). (b) The running time is shown for all 256 machines in each iteration in the box plot. From the plot, we observe that we achieve an optimal parallelism, as the costs are almost the same as those in the small-scale experiment for a single machine.