Privacy-Preserving Linear Regression on Distributed Data by Homomorphic Encryption and Data Masking

Linear regression is a basic method that models the relationship between an outcome value and some explanatory values using a linear function. Traditionally, this method is conducted on a clear dataset provided by one data owner. However, in today’s ever-increasingly digital world, the data for regression analysis are likely distributed among multiple parties and even contain sensitive information about the data owners. In this case, data owners are not willing to share their data unless data privacy is guaranteed. In this paper, we propose a novel protocol for conducting privacy-preserving linear regression (PPLR) on horizontally partitioned data. Our system architecture includes multiple clients and two noncolluding servers. In our protocol, each client submits its data in encrypted form to a server, and two servers collaboratively determine the regression model on pooled data without learning its contents. We construct our protocol with Paillier homomorphic encryption and a new data masking technique. This data masking technique can perturb data by multiplying a rational number while the data are encrypted. Due to the use of the data masking technique, the efficiency of our protocol is greatly improved. We provide an error bound of the protocol and prove it rigorously. We also provide security analysis of the protocol. Finally, we implement our system in C++ and Java, and then we evaluate our protocol using real datasets provided by UCI. The experiments show our protocol is one of the most effective approaches to date and has negligible errors compared with performing linear regression on clear data.


I. INTRODUCTION
With the extensive use of computer technology, many institutions and individuals have accumulated a large amount of data. This causes an increasing demand on collaborative mining among multiple data owners. Because many data owners lack professional skills and computing resources for data mining, they have to outsource this work to a service provider. If all data owners agree to submit their clear data to a service provider for mining information of common interested on the pooled data, the data mining over distributed data can be easily accomplished. However, in many cases, data owners are unwilling to share their data because some sensitive information of data owners may be contained in the data.
The associate editor coordinating the review of this manuscript and approving it for publication was Sedat Akleylek .
For example, some medical institutions want to evaluate the effect of different treatment plans for patients with a particular disease. Each treatment plan could be combination of several drugs in specific proportions. Under the premise that patients' privacy is well protected, all the medical institutions are willing to contribute their data for analysis on the combined dataset. Another example concerns personal credit evaluation. Personal credit is the basis of some individual economic activity, such as a mortgage loan, a car loan. Credit evaluations require data provided by banks, insurance companies, e-commerce firms, judicial offices, etc. All of these institutions have their own data usage policy. It is almost impossible for a third party to collect data from all these institutions in clear text form. This make comprehensive evaluation of personal credit become a difficult problem. These examples belong to distributed privacy-preserving data VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ mining. As we enter the age of big data, such demanding scenarios will become increasingly common.
In this paper, we focus on privacy-preserving linear regression (PPLR) on a distributed dataset. The approach addresses how several servers (service providers) conduct linear regression, where the data are divided among multiple clients (data owners), and any information about clients' data cannot be learned by any server. The PPLR is a problem that belongs to privacy-preserving data mining (PPDM) [1]. Thanks to the efforts of many experts in PPDM, now we have at least four techniques to contend with the PPLR problem. The first method is based on data perturbation [2]- [4]. In this method, each client first masks data with mathematical transforms, such as rotation, translation and swap. Then, a service provider performs a mining algorithm on the pooled dataset. High efficiency is the prominent advantage of this method. The downside is that a perturbation that is too large leads to a loss of data utility, wheres too small of a perturbation results in a privacy breach. The second method is linear secret sharing [5]- [7]. This method transforms each data element into the sum of several random parts. Each part is sent to a different server. The servers collaboratively perform a mining algorithm using these data parts. Because each server cannot expose its data part to other servers, the arithmetic operations become very complex [8]. The data privacy is assured unless the number of colluded servers exceeds a threshold. The third method is the garbled circuit [9]- [11]. In general, this method has two parties -Alice and Bob. Alice has data x and Bob has data y. They want to compute f (a, b), where f is a public function. By using the garbled circuit technique, Alice and Bob can obtain f (a, b) without learning anything about the other party's data. This method cannot be applied to the PPLR problem directly because Alice and Bob should play the role of service provider and they do not have any input data. Therefore, we must combine the garbled circuit method with other techniques to ensure the clients' privacy is not compromised. Garbled circuits are less efficient than other methods when solving the same problem. The fourth method is homomorphic encryption [12], which is a kind of public key encryption. The homomorphic encryption enables a server to perform computation directly on encrypted data without accessing a secret key, and the results of the computations remain in encrypted form. An attractive idea about using homomorphic encryption for the PPDM problem is one in which clients submit encrypted data to the server and the server conduct data mining on encrypted data directly. Until now, practical homomorphic encryption, known as partial homomorphic encryptions [13], [14], have only supported one kind of arithmetic operation -addition or multiplication. In contrast, fully homomorphic encryption [15], [16] supports both addition and multiplication. Although a major breakthrough had been made in fully homomorphic encryption, it is still far less efficient than other techniques. Most researchers still choose partial homomorphic encryption in their research. In this paper, we develop our approach also based on partial homomorphic encryption.
Privacy-preserving linear regression or ridge regression over distributed datasets has received considerable attention in recent years. Some approaches have been proposed. Each of them used one or more the aforementioned techniques.
In 2005, Karr et al. [22] proposed a protocol for PPLR based on secure summation. Their approach is highly efficient but does not conceal important intermediate values such as the covariance matrix. In 2011, Hall et al. [6] proposed a protocol for PPLR based on secret sharing and homomorphic encryption. Different from Karr et al.'s approach, their protocol conceals the covariance matrix. However, the iteration they used for computing the inverse matrix is inefficient. In 2017, Mohassel et al. [7] designed a computing framework for PPDM based on two-party secret sharing and provided fixed-point multiplication with O(1) complexity. Their approach is highly efficient at the phase of online computing but still requires expensive offline precomputation to prepare for online multiplication.
In 2013, Nikolaenko et al. [10] proposed a protocol to address privacy-preserving ridge regression on distributed records. They used additive homomorphic encryption to generate the covariance matrix and right-hand-side vector, and then used garbled circuit to determine the final result. Experiments show their hybrid protocol is more efficient than Hall et al.'s protocol. Gascón et al. [17] extended Nikolaenko et al.'s protocol to vertically partitioned datasets in 2017. They designed a secure inner product protocol for data aggregation and garbled circuits of the conjugate gradient decent algorithm for solving linear equations.
Hu et al. [18] in 2017 proposed a protocol for privacy-preserving ridge regression that was completely based on additive homomorphic encryption. They design a packed secure multiplication protocol, which is used to construct the secure Gauss elimination and Jacobi iteration algorithm. These algorithms are used in solving linear equations. In 2018, Chen et al. [19] designed a protocol for privacy-preserving ridge regression. In their protocol, multiplicative and additive homomorphic encryption are used together to implement data aggregation and solving equations. Their experiments show that the protocol based on homomorphic encryption alone is more efficient than the hybrid protocol using garbled circuits.
In 2018, Giacomelli et al. [20] proposed a protocol based on homomorphic encryption and a data masking technique. By using homomorphic encryption, they mask data in encrypted form, and then decrypt it and solve the masked linear equations. Their approach is similar to ours, but our data masking method is different. Data masking can provide a sufficient level of security and greatly reduces the computational cost.

A. OUR CONTRIBUTIONS
Our work in this paper follows the research line of homomorphic encryption. Our system framework is shown in figure 1. The Evaluator and Crypto-Service provider (CSP) are two servers, and multiple clients are data owners. In our protocol, each client submits its data in encrypted form to the Evaluator, and the Evaluator cooperates with CSP to determine the data model on the pooled data without learning its contents. Our protocol is based on a semi-honest model, and the two servers do not collude with each other.
Our contribution can be summarized as follows: • We propose a new protocol for conducting privacypreserving linear regression on horizontally partitioned data. By combining homomorphic encryption and a data masking technique, two servers determine the data model and cannot learn any information about the input data. We design a new data masking technique that makes our protocol simple and effective. Furthermore, each client can be offline after submitting its data because the client does not participate in any subsequent computations.
• We derived an error bound of the protocol and prove it with rigorous matrix theory. Furthermore, we conduct a security analysis that shows our protocol is secure in a statistical sense.
• We implement our protocol in C++ and Java language, and then evaluate its accuracy and efficiency by performing experiments on real datasets provided by UCI. We also compare our protocol with the state-of-the-art solution. The experiments show our protocol is not only one of the most effective approaches to date but also sufficiently accurate with only negligible error.

II. PROBLEM STATEMENT
We first give some notations used in this paper.
-Bold uppercase -matrix (e.g., X, A, U, V) -Bold lowercase -vector (e.g., y, b, x i ,b 1 ,b 2 , α, β, γ ) -Normal lowercase -real number or integer (e.g., a ij , y i ) f -round a real number f down to the nearest integer f -round a real number f up to the nearest integer f -round f to the integer that is the floor or ceiling of f -(p, q)-bit length of the integral and fractional part of a fixed-point number pk, sk -public key and private key -E(·), D(·) -encryption and decryption function -E(A) -a matrix that consists of elements E(a ij ) -E(b) -a vector that consists of elements E(b i ) We use x ij to denote the (i, j) element of X or x i . Similarly, α i is an element of α, etc.

A. ARCHITECTURE AND OUR GOAL
Our system architecture consists of three entities -a CSP, an Evaluator and r clients, as shown in Figure 1. Suppose there are m data pairs (x i , y i ) used in regression, where 1 ≤ i ≤ m, x i ∈ R d and y i ∈ R.
-Clients: Each client holds part of the data pairs in our system. x i includes d explanatory values, and its corresponding output is y i . (x i , y i ) contains some sensitive information of the client. -CSP: The CSP generates public/private key pair of homomorphic encryption and sends the public key to the Evaluator and clients in the initialization phase. -Evaluator: The Evaluator sets up parameters for our protocol in the initialization phase and performs privacy-preserving regression on the combined dataset of all clients with the help of CSP. Finally, it outputs the result of regression analysis. We call d the number of data features in this paper. Our goals are threefold: to build a model without revealing any input data or important intermediate value to servers, to free clients from complicated data mining tasks, and to design a highly efficient protocol that is suitable for practical use. Here, the important intermediate values are the covariance matrix and its corresponding right-hand-side vector in the linear regression (ref. Section III. A). Some research [6], [23] has shown that revealing the covariance matrix or the corresponding right-hand-side vector may be a cause of privacy leakage.
We assume that all parties are honest-but-curious and the Evaluator and CSP do not collude with each other. This means all participators follow the instruction of protocol but try to learn privacy information through observing the protocol execution.

B. INPUT DATA DISTRIBUTION
In our system, input data are horizontally partitioned among r clients. This means each data (x i , y i ) cannot be split and belongs to one client. Therefore, we can find some integers l k that satisfy 0 = l 0 < l 1 < · · · < l r = m, and client k (1 ≤ k ≤ r) holds the data (x l k−1 +1 , y l k−1 +1 ), (x l k−1 +2 , y l k−1 +2 ), · · · , (x l k , y l k ) We can denote the client k's data as matrix X k and vector y k as follows. Additionally, all input data can be denoted as matrix X and vector y as follows. Figure 2 shows the relations between X and X k , y and y k .

A. LINEAR REGRESSION
Linear regression is one of most widely used approaches for prediction. Its theory is classical and can be found in textbook [29]. Linear regression takes a large number of data as input and outputs a best-fit linear equation for these data.
Given a set of To simplify the problem, denote x i = (x i1 , · · · , x id , 1) T and β = (β 1 , · · · , β d , α) T . Then, the goal of linear regression is to find the best-fit function y = β T x.
The common way to compute β is the least square method, which needs to solve the linear system In particular, Because our input data are horizontally partitioned, client Later in this article, we always assume that the matrix X or X k contain d + 1 columns.

B. THE PAILLIER CRYPTOSYSTEM
The Paillier cryptosystem [13] is the core encryption used in our protocol. It is a semantically secure [21] and additive homomorphic encryption scheme. Semantic security makes it impossible for any polynomial algorithm to gain extra information about a plaintext when given only its ciphertext and public key. As an asymmetric encryption, the Paillier cryptosystem has a public/private key pair (pk, sk). pk := (n, g) and sk := λ(n) = lcm(p 1 − 1, q 1 − 1) where n = p 1 ·q 1 , p 1 and q 1 are distinct large primes, g ∈ Z * n 2 and n divides the order of g, λ(n) is the Carmichael's function on n. Encryption: where E pk is the encryption function with public key pk. Decryption: Given ciphertext c ∈ Z * n 2 , its plaintext m is where D sk is the decryption function with secret key sk and L(u) := (u − 1)/n. Given a, b ∈ Z n , the Paillier encryption scheme satisfies the following properties: As a special case, for a positive integer k E pk (ka) = E pk (a) k mod n 2 In this paper, each client encrypts its data with Paillier encryption before submitting them to a service provider. To simplify the problem description, we use E(·) and D(·) instead of E pk (·) and D pk (·) elsewhere in this paper. In addition, we also omit the ''mod'' suffix when we describe the homomorphic addition later in this paper.

C. DATA REPRESENTATION
Clients' input data could be real numbers, but Paillier encryption only works on nonnegative integers; therefore, data conversion must be done before using Paillier encryption. In our protocol, a fixed-point number is used to realize conversion between a real number and an integer. The fixed-point number has a q-bit fractional part and p-bit integral part. Because q is fixed, we can drop the binary point and use a binary integer to represent the fixed-point number.
Furthermore, the integer used in Paillier encryption is nonnegative, so the clients need to convert all x i and y i into nonnegative numbers locally at the beginning of the protocol. One solution is two's complement representation, which represents a negative integer with a positive integer. Our solution is find a large enough number M that makes x ij +M ≥ 0 and y i + M ≥ 0 for all data. In the initial phase of the protocol, each client converts a real number a to a nonnegative integer by the following formula where ufix(a) is an unsigned fixed number corresponding to a. Some fixed-point arithmetic operations are also used in our protocol as follows: - Parameter q is important because it influences the accuracy of calculations. p is trivial in most cases because the range of numbers in practice is far less than the upper limit of the encryption scheme.

IV. SECURE LINEAR REGRESSION PROTOCOL A. THE KEY IDEA OF OUR PROTOCOL
The execution process of our protocol includes three phase: initialization, aggregation and regression (see figure 1). The main task of each phase is as follows.
(1) initialization: The CSP generates key pair (pk,sk), and sends pk to others; the Evaluator sets (p,q) for a fixed-point number and sends (p,q) to others. Client k converts data to nonnegative integers.
We choose two random numbers w 1 , w 2 and letÂ := UAV, b := Ub,b 1 := w 1 Sb,b 2 := w 2 Tb; hence, Then, we construct two linear equations as followŝ Solving these equations, we can obtain the final result The maskedÂ,b 1 andb 2 prevent information leakage about A or b. In our protocol, the Evaluator first chooses U, V, w 1 and w 2 , then manages to obtain E(Â), E(b 1 ), E(b 2 ), and sends them to the CSP. The CSP decrypts the received data, solves the equationsÂξ =b 1 andÂη =b 2 , and sends back the solution ξ and η. The Evaluator uses ξ and η to determine β.
However, this is not secure because the CSP will obtain u i a ij v j when solving the data masked equations. If a ij , u i and v j are prime numbers, it is not difficult for the CSP to guess a ij . In our protocol, the elements in U and V are rational numbers that have the same denominator. That mean the elements in U and V are u i /e and v j /e, where u i , v j and e are integers. To obtain E(u i a ij v j /e 2 ), we first compute E(u i a ij v j ) = E(a ij ) u i v j . Then, we use an encrypted division technique (in protocol 2) to obtain E( u i a ij v j /e 2 ). The u i a ij v j /e 2 is the integer near u i a ij v j /e 2 . Actually we cannot compute E(u i a ij v j /e 2 ) in most cases because u i a ij v j /e 2 may not be an integer. Replacing u i a ij v j /e 2 with u i a ij v j /e 2 only causes negligible error.

B. INPUT DATA CONVERSION
In the inintialization phase, the client needs to convert its data to nonnegative integers. If all clients' primary data are nonnegative, each client can simply convert data to integers by using formula a · 2 q , where a represents primary data. If there are some negative input data, our method is to pass a positive real number M to all clients, who can convert the local data to nonnegative numbers by using it, and then convert the data to integers. We propose the protocol 1 to complete data conversion when there are negative data.
In protocol 1, E( −z k · 2 q ) are collected by the Evaluator before they are sent to the CSP. We do this to keep more information about the client secrete. The M is known to all clients and the CSP in protocol 1. This means M is almost public information. If M is sensitive information, each client VOLUME 8, 2020

Protocol 1 Clients' Data Conversion
Parties: Clients Evaluator CSP Input: x ij , y i (p, q) pk Output:x ij ,ȳ i ∈ N 1: Client k finds the minimum value z k in the local negative dataset Ng = {x ij , y i |x ij ≤ 0, y i ≤ 0, l k−1 < i ≤ l k , 1 ≤ j ≤ d}, where k = 1, 2, · · · , r. 2: Client k computes E( −z k ·2 q ) and sends it to Evaluator. 3: Evaluator sends all E( −z k · 2 q ) to CSP. 4: CSP decrypts all E( −z k ·2 q ), finds the maximum value M in { −z 1 · 2 q /2 q , −z 2 · 2 q /2 q , · · · , −z r · 2 q /2 q } and passes it to all clients. 5: Each client converts local data to nonnegative integers by can add a random positive integer to −z k · 2 q before it is encrypted and passed to the Evaluator.

C. DATA MASKING TECHNIQUE
In our system, the Evaluator holds some data that can be viewed as E(x), integer c and public integer d. To mask the data x, the Evaluator needs to compute E( xc/d ). It is easy to compute E(xc) = E(x) c . The problem is how to compute E( xc/d ) using E(xc) and d. Until now, there is no way for the Evaluator to complete this task alone. In our data masking protocol, we use Veugen's [26] approach to perform division between E(xc) and a public divisor d.
In protocol 2, the E( r/d ) −1 is a modular inverse of E( r/d ). In proposition 1, we prove the correctness of protocol 2.
In protocol 2, d is public. If we can design a protocol that can mask our data while c and d is only known to A, the data masking protocol will be perfect. However, this is not easy work. Some attempts [24], [25] have been made to deal with the problem of computing E(a/b) from E(a) and E(b). However, all of these methods are too complicated and time consuming. If our protocol spends too much time on data masking, our advantage in terms of efficiency will lost.
We include an example here to show why this data masking technique enjoys a high level of security and accuracy. For simplicity, we perform an arithmetic operation on fixed-point decimal numbers in this toy example.
Suppose input data are given as x = 2871234561091231; the low-order 8 digital numbers represent the fractional part. We choose an 8-digit prime number d = 39999931 and a random integer c = 8233875439. We mask x as follows and obtain y: Let us consider security first. The CSP can see y after decrypting E(y), and the CSP also knows divisor d; now the CSP wants to know x. What he can do is compute yd and guess x based on it. However, Actually, there are 7 or 10 different digits between yd and xc in this example. It is difficult to guess xc based on yd. Now, we consider the accuracy when the CSP uses y in computation. The exact computation can be done by using xc/d instead of y. Obviously, Because the low-order 8 digital numbers are the fractional part, the difference between y and xc/d is actually less than 10 −8 . Obviously, we can obtain sufficient accuracy if the length of the fractional part is large enough.

D. OUR PROTOCOL FOR DISTRIBUTED DATA
Our protocol described here is the main contribution of this paper. The protocol comprises three phases: initialization, aggregation and regression. Our data masking protocol (protocol 2) is used in regression phase. Finally, the Evaluator outputs the regression model β in the clear.
In protocol 3, operator ⊗ denotes the elementwise product of two matrices or vectors.

Protocol 3 Secure Linear Regression for Distributed Data
Input: , k = 1, 2, · · · , r Evaluator: p, q -bit length of integral and fractional part e -a public denominator (prime number) CSP: pk Output: Evaluator: regression model β ∈ R (d+1) 1. Initialization a. Evaluator sets (p, q) and sends it to all clients; then it chooses a q-bit prime number e for data masking and passes it to CSP. b. If all data are nonnegative, each client converts their data to integers locally. Otherwise, all clients, the Evaluator and CSP execute protocol 1 to complete data conversion. Finally, all x ij and y i become nonnegative integers. c. CSP generates a key pair (pk, sk) of the Paillier encryption scheme and sends pk to all other parties. 2. Aggregation a. Client k computes A k = X T k X k , b k = X T k y k and encrypts these data locally, then passes E(A k ) and E(b k ) to the Evaluator, where 1 ≤ k ≤ r. b. Evaluator computes matrix E(A) and vector E(b): a. Evaluator chooses random integer v i , s i , t i , w 1 , w 2 in (e, 2 10 e), and obtains u i = s i +t i . where 1 ≤ i ≤ d +1. b. Evaluator (role A) and CSP (role B) execute protocol 2 to mask data as follows.
Parties: A B Input 1: E(a ij ), u i v j , e 2 pk, e 2 Output 1: E( u i v j a ij /e 2 ) Input 2: E(b i ), w 1 s i , e 2 pk, e 2 Output 2: E( w 1 s i b i /e 2 ) Input 3: Then, Evaluator sends the output 1,2,3 to CSP. d. CSP decrypts the data received and obtains then CSP solves two equations as follows, and returns ξ , η to Evaluator.
e. Evaluator computes the solution β as follows

V. PROTOCOL CORRECTNESS AND ERROR ANALYSIS
In this section, we discuss the correctness of protocol 3 and conduct error analysis. First, we define diagonal matrix S, T, U, V, matrixÂ, vectorb 1 ,b 2 as follows S = diag(s 1 /e, s 2 /e, · · · , s d+1 /e) T = diag(t 1 /e, t 2 /e, · · · , t d+1 /e) U = S + T = diag(u 1 /e, u 2 /e, · · · , u d+1 /e) V = diag(v 1 /e, v 2 /e, · · · , v d+1 /e) (5.1) where s i , t i , u i , v i , w 1 , w 2 , and e are random integers in protocol 3. According to definition (5.2), the element ofÂ,b 1 andb 2 is u i v j a ij /e 2 , w 1 s i b i /e 2 and w 2 t i b i /e 2 . Obviously,Â,b 1 and b 2 is the accurate version of A, b 1 and b 2 in protocol 3. We construct functions as followŝ Aξ =b 1 ,Âη =b 2 Then, we use ξ and η to compute β in the same way as we compute β in protocol 3. We obtain This means β is the true solution when we ignore the error in protocol 3. This process is just the same as what we described in the key idea section. Now, the problem is how do small changes toÂ,b affect the solution β. From the classical textbook of numerical analysis [27], a theorem can be found as follows.
Lemma 1: Given two linear systems of equations Ax = b and (A + A)x = b + b, where A and b are small changes, let x ≡x − x; then, the estimation holds up as follows where K(A) = A A −1 is the condition number that measures how sensitive the linear system is to changes in A and b. · is the matrix and vector norm. Lemma 1 is used in our error analysis of protocol 3. Now, we present our conclusion as follows. Proof: Obviously, β is the approximate solution we obtain in protocol 3, and β is the true solution of Aβ = b.

Proposition 2 (Error Estimation): Given two linear systems of equations aŝ
First, consider the error of the right-hand items. Using the infinity-norm, we have In protocol 3, we have w 1 > e and w 2 > e. Hence, From the definition ofb 1 and we also know v j > e; hence, From the definition ofÂ, V and u i > 2e, we obtain From (5.6), (5.7) and m > d + 1, we obtain According to lemma 1, (5.5), and (5.8), we obtain Since 2 < u i /e = (s i + t i )/e < 2 11 and U, U −1 are diagonal matrices. We have Combining (5.9), we obtain In most cases, protocol 3 provides a solution with enough precision. For example, suppose A is well conditioned when K(A) < 2 50 (≈ 1.1 · 10 15 ). If we choose q = 40, then In fact, our experiments in section VII show this is a rather conservative estimate. This estimate shows the larger the q we choose, the more accurate the solution we obtain.
Remark: There are two preconditions for the establishment of proposition 2. The first one is m ≥ d + 1 and is easy to satisfy because the number of data used in analysis is often greater than the number of data features in practice. The second one is m i=1 y i ≥ 2 q and indicates that the sum of all y i in the primary data is no less than 1. This is also easy to satisfy.

VI. SECURITY ANALYSIS
Our system is secure provided all participators are honestbut-curious. At first, all clients contribute their own data and cannot see any intermediate result, it is impossible for clients to learn some information about others except that which is revealed by the final result. Therefore, we only analyze two servers.

A. SEMI-HONEST EVALUATOR
The Evaluator only have some ciphertext, such as E(a ij ), E(b i ), E( u i v j a ij /e 2 ), E( w 1 s i b i /e 2 ), and E( w 2 t i b i /e 2 ). Moreover, he also knows some relations between the plaintexts corresponding to these ciphertexts, such as The relations are unhelpful except that an Evaluator can use them to deduce one plaintext based on knowledge of the other. For example, if the Evaluator knows a ij , he can compute u i v j a ij /e 2 , or vice versa (ignoring small difference). Therefore, the essential problem is still how to crack the unrelated ciphertext. However, the Paillier cryptosystem is semantically secure. It is impossible for the Evaluator to gain information about a plaintext when given only its ciphertext and public key. Hence, the security of the Paillier cryptosystem can guarantee the Evaluator cannot learn anything about input data and important intermediate values.

B. SEMI-HONEST CSP
In our data masking protocol, the CSP can obtain some results, such as a ij + r ij , b i + q i , where r ij and q i are random integers and used only once. Therefore, the client's data cannot be disclosed to the CSP in this procedure.
In the regression phase, the CSP obtains a masked matrix A and right-hand-side vector b 1 , b 2 . The element of them is u i v j a ij /e 2 , w 1 s i b i /e 2 , w 2 t i b i /e 2 . Here, we ignore the influence of operation · and only assess the security of multiplicative perturbation. Suppose the CSP has data u i v j a ij /e 2 , w 1 s i b i /e 2 , w 2 t i b i /e 2 . If the CSP wants to guess the value of a ij and b i , he can use the final result β to confirm his conjecture. The CSP must guess the value of all a ij and b i and then compute a new modelβ; ifβ ≈ β, then his guess is likely to be right.
Because the value of random v i , s i , t i , w 1 , and w 2 are limited to any integer between e and 2 10 e in protocol 3, the probability of guessing one of them correctly is 1/(1023e) for the CSP. Let us investigate a simple case: given m plane points (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x m , y m ) for a linear regression. In such a case, the number of data features is d = 1 and the dimension of A is 2; therefore, the CSP have two linear system as follows.
Suppose m is known to all parties. A is a symmetric matrix; this means a 12 = a 21 . The CSP can crack the secret as follows.
(1) Suppose the CSP knows v 1 and v 2 ; he can computes u 2 from the value of u 2 v 2 m. (2) Then, from the value of u 1 v 2 a 12 /u 2 v 1 a 21 , the CSP determines u 1 . (3) Next, if the CSP also knows s 1 and s 2 , he can obtain t 1 and t 2 because u 1 = s 1 + t 1 , u 2 = s 2 + t 2 . (4) Finally, if the CSP also know w 1 and has the final answer β from the Evaluator, he can determine w 2 from β = V e ξ /w 1 + e η/w 2 by using v 1 , v 2 , w 1 and β.
Therefore, there are 5 independent random variables: v 1 , v 2 , s 1 , s 2 , and w 1 . The probability that the CSP can guess A and b correctly is 1/(1023e) 5 . If e is a 40-bit long prime, the odds are approximately 1/2 245 . For a regression analysis with d data features, the probability that the CSP can determine A and b is 1/(1023e) 2d+3 . From the above analysis, it can ascertained that our protocol is statistically secure by using multiplicative perturbation. Moreover, the greater the e is, the safer the data become.

VII. EXPERIMENT AND COMPARISON
We assess our privacy-preserving linear regression protocol by a set of numerical experiments on some real datasets. All experiments are performed on a Dell Mobile Workstation M6800 with an 8-core CPU at 2.70GHz and 32 GB RAM, running Ubuntu Linux 16.04. We implement our protocol using C++ with the NTL library and Java on the OpenJDK 1.8 platform. The Paillier encryption scheme with a 1024-bit modulus is adopted in our implementation.

A. DATASETS
We choose 6 datasets from the UCI repository [28] for our experiments. These datasets are listed below.

• Auto MPG
This dataset contains 398 records about cars regarding attempts to predict miles per gallon for each. We removed the car name attribute and 6 records that have missing values of horsepower. In the end, the dataset has 1 target and 7 predictive attributes, and contains 392 records.
• Wine Quality It contain 4898 records of wine used for predicting wine quality. We choose a dataset about white wine that has 11 predictive attributes and 1 target attribute.
• Bike Sharing This dataset contains 17379 records about bike rentals. We remove the record index, the date, count of casual users and count of registered users from the dataset hour.csv. We leave 12 attributes to predict the count of total rental bikes.

• Forest Fires
The dataset contains 517 records used in predicting the burned area of the forest. We remove the month attribute and alter the content of the weekday attribute from 'mon' to 1, 'tue' to 2, etc. Finally, we have 11 predictive attributes and 1 target attribute.

• Communities and Crime
This dataset contains information from 1994 communities. We remove 5 nonpredictive attributes and use 122 attributes to predict the number of crimes per capita for each community. All missing values are replaced with 0s.
• YearPredictionMSD This dataset describes 515344 songs with 90 audio features for each one. It is used for predicting the release year of each song. To speed up the process, only the top 100000 records are used.
For every dataset, we add integer 1 at the end of each line to determine the constant term of the regression equation. We modify some datasets only for obtaining usable data for our experiment, not for serious prediction research. Furthermore, all data has not been normalized.

B. EVALUATION OF ACCURACY
When we calculate something likeâ = u i v j a ij /e 2 in protocol 3, we obtainã = u i v j a ij /e 2 instead. This procedure causes very small calculation error. Becauseâ andã are integers corresponding to fixed point numbers, |â −ã| < 1 means the real gap between these two data is less than 2 −q . This indicates that the accuracy of the data masking procedure depends on q. Furthermore, because q is the length of the fractional part of the fixed-point number, it also has a great influence on the truncation error of our system. Consequently, the accuracy of the protocol depend on q. The larger the q is, the more accurate the calculation is.
Let β be the model learned in the clear data and β learned in our system. We define the relative error of our system as: Table 1 shows the accuracy of β is improved with the increasing of q. Moreover, the results also indicate there is no obvious correlation between m and Err β , or between d and Err β . Although the condition number K(A) is used in evaluating the error bound in section V, there is no direct correlation between K(A) and the accuracy of the output.
It must be pointed out that the change of q almost has no impact on the running time of the protocol. In our experiments listed in table 1, the running time has no identifiable differences between q = 10 and q = 50 for any dataset. In fact, after clients encrypted a data, the corresponding cipher is always a 1024-bit long integer, no matter what the value of q is. Moreover, when two servers execute our protocol, they always perform computation with encrypted data, except that the CSP will solve two data masked linear systems. Because the time spent on a normal multiplication or division is less than 1% of that spent on an encryption (see table 5), our protocol spends almost all running time on crypto operations.

C. EVALUATION OF RUNNING TIME
In this section, we focus on the time spent on online computation. Because computation in the initial phase of our protocol is mainly offline computation, we just evaluate the time efficiency of aggregation and regression phase.   The results of another experiment are shown in figure 4. In this experiment, both m and d are taken as variable. We set d to 10, 30, 50, 70, and 90, and for each value of d, we survey the running time of the protocol (including the aggregation and regression phase) when m is varied from 20000 to 90000. The results reveal that the running time of the protocol increases with the increasing of m or d, but d has a more obvious influence on the running time. For example, if d is fixed at 90, the running time only rises by 30% when m runs from 20000 to 90000. However, when m is set to 60000, the running time rises by 700% when d increases from 30 to 90. The reason this occurs is that except that each client costs O(md 2 ) arithmetic operations in step (a) of the aggregation, m is unrelated to all other computation. Thus, our protocol is relatively more sensitive to the variation of d.

D. COMPARISON WITH OTHER PROTOCOLS
In recent years, some protocols have been proposed to tackle the privacy-preserving linear regression or ridge regression on distributed data. Most of these solutions fall into two categories: protocols that combine garble circuits and homomorphic encryption and protocols based on homomorphic encryption alone.   [19] protocol uses Paillier's and ElGamel's encryption to generate and solve linear equations in encrypted form. Table 3 shows the comparison of experimental results. Due to the adoption of data masking technique, our protocol is more efficient than their protocol under the similar experimental conditions. VOLUME 8, 2020

3) COMPARISON WITH GIACOMELLI et al.'s PROTOCOL
Finally, we compare our protocol with Giacomelli et al.'s [20] protocol, which is also based on homomorphic encryption alone. Moreover, we implement Giacomelli et al.'s protocol in C++ language. Therfore, the results in this section all come from our experiment. Both of the protocols take advantage of different data masking techniques. In the aggregation phase, these two protocols adopt the same method for horizontally partitioned data. To compare the two protocols, we only need to focus on the regression phase. Table 4 shows the computational cost of the two protocol's regression phase; the cost counts all operations of the Evaluator and CSP. The Enc, Dec, and Exp in table 4 represent encryption, decryption and the modular-exponentiation operation in Paillier's cryptosystem. The Arithmetic operations not only include the basic arithmetic operations, but also the modular multiplication (Mul) in Paillier's cryptosystem. Table 5 shows the average ratio of time spent on other operations to that on encryption. Obviously, the cost on Arithmetic operations is much less than that on other crypto operations. This means the running time of the protocols is mainly determined by Enc, Dec and Exp. Among these operations, Exp is a special one because the operation cost rises with the increase of the exponent. When increasing the bit-length of the exponent from 60 to 500, the value of Exp/Enc rises from 6% to 50% approximately.   (see table 4). Figure 5 shows Giacomelli et al.'s protocol is more efficient than ours when d is less than 70, but when d is greater than 70 our protocol is better. The top 5000 records of the YearPredictionMSD dataset are used in this experiment, and some columns are added to the dataset to fit the experiment. The bit-length of the exponent in the Exp operation is designated as 60. Obviously, if a larger number is set for the exponent, the performance of our protocol can exceed their protocol earlier.
When comparing the communication cost, only the regression phase needs to be considered because the approach adopted for aggregation in both protocols is the same. For the regression phase, the communication cost of Giacomelli et al.'s protocol is d 2 +2d, while our protocol is 3d 2 +4d. The reason for this is the linear equations need to be transmitted only once in Giacomelli et al.'s protocol, but three times in our protocol.
Although efficiency differences exist between Giacomelli et al.'s protocol and ours, the difference is far smaller than that between protocols belonging to different types. We believe both of protocols are fit for solving the same problem in practice.

VIII. CONCLUSION
In this paper, we propose a protocol that learns a linear regression model over distributed clients' data without leaking any information of the client to the service provider. Theoretical analysis and numerical experiments have been performed to verify its efficiency, accuracy and security. By taking advantage of the data masking technique, our protocol can be more efficient than most existing protocols. By combining the merits of homomorphic encryption and data masking, our protocol is able to realize a high level of security and accuracy. These advantages make our protocol ideally suited for practical application, especially for realizing a regression module in a privacy-preserving machine learning task.
As future work, we believe the idea about introducing rational number data masking into encrypted data is a basic technique, and we are interested in extending this technique to other privacy-preserving machine learning methods.