Asymptotically Optimal and Secure Multiwriter/Multireader Similarity Search

Privacy-preserving similarity search is a method of data retrieval from potentially untrusted hosts based on the similarity between encrypted data items. In this setting, a major concern is how to support searches when multiple users (multireader) request for searching similar items over data encrypted by multiple data owners (multiwriter). Unfortunately, previous similarity search schemes address this by enforcing users to communicate with data owners. This limitation incurs a significant communication overhead. Moreover, these schemes use deterministic algorithms to encrypt data, which not only violates the privacy of data but also complicates the proof of semantic security. In this paper, we propose an efficient and secure multiwriter/multireader similarity search scheme over encrypted data in cloud storage. In the proposed scheme, the cloud server is able to perform searches without incurring any interaction between users and data owners. Thus, we achieve asymptotically optimal communication cost. We provide rigorous proofs of data privacy in the standard model. Then, we show the proposed scheme achieves semantic security based on the data privacy. An in-depth experiment on an INRIA image dataset demonstrates the practicality of the proposed scheme.

encrypted data items [23], [24], [25]. In these schemes, a data 23 owner (or writer) and a user (a reader) must share a common 24 secret key to encrypt their data contents and queries, respec- 25 tively. However, sharing secrets is impractical in modern data 26 outsourcing systems because the data owner and the user are 27 unlikely to be in the same trust domain. 28 To tackle this issue, Cui et al. showed how to enable 29 similarity searches even when data owners (multiwriter) and 30 users (multireader) encrypt their data contents and queries 31 The associate editor coordinating the review of this manuscript and approving it for publication was Sedat Akleylek . under different keys they each generate (This is called a 32 multiwriter / multireader (M/M) setting.) [27]. In this scheme, 33 however, users must communicate with all of the data owners 34 to re-encrypt their queries under the data owners' keys dur-35 ing searches. Because both queries and data owners can be 36 numerous in M/M settings, such limitation critically limits the 37 scalability of similarity searches. Hahn et al. proposed a solu-38 tion minimizing such an overhead [28]. Their scheme adopts 39 a trusted key server which generates secret keys on behalf 40 of data owners and users. These secret keys are correlated 41 with some public parameters which are used by the cloud 42 server to perform searches without re-encryption. However, 43 securely retrieving the encrypted similar data and decrypting 44 them are not presented in the scheme. Moreover, a formal 45 security proof is not presented in the study. 46 Another problem of previous similarity search schemes is 47 the violation of query privacy. Assume a user sends a series 48 of encrypted queries, but no match is found in a database. 49 In this scenario, an adversary can infer the plain query data 50 from the ciphertext by exploiting side information such as a 51 encrypted database, a user (reader) encrypts trapdoors which 107 is used as search queries, and an untrusted server runs an 108 equality test between the indexes and trapdoors without 109 decrypting them. Since most searchable encryption schemes 110 focus only on single keyword searches, they do not provide 111 flexible search capabilities. 112 To address the issue, fuzzy searchable encryption was 113 introduced [22], which is similar to conventional searchable 114 encryption in that it also provides a keyword search. The main 115 difference is that, as ''fuzzy'' indicates, it supports flexibility 116 when searching for keywords, i.e., minor typos or format 117 inconsistencies are accepted. However, the fuzzy searchable 118 encryption is still inefficient because, in real-word datasets, 119 even the same data can have diverse formats, encodings, 120 or edits [30], [31]. 121 To enhance accuracy of similarity search, most schemes 122 use identifiable information extracted from data such as 123 images to measure the similarity between different data 124 items [23], [24]. Specifically, as a building block, these 125 schemes rely on an approximate near neighbor search algo-126 rithm called locality sensitive hash [3]. Since an LSH func-127 tion, on input two similar but different data items, returns 128 the same hash values. These values can be used to test the 129 similarity between data. To preserve privacy, the LSH output 130 values are encrypted and published as a searchable index (and 131 trapdoor) in the similarity search schemes. If the encryption 132 algorithm is deterministic, one can test whether two data 133 items are similar if their searchable index and trapdoor are 134 the same.  In similarity searches, a practical challenging issue is how 146 to efficiently support multiple readers and writers, while 147 achieving high performance and scalability. Because SSE 148 and PEKS are not suitable for multireader settings, recent 149 work on S/M settings has allowed both the data owner to 150 encrypt indexes and multiple users to search [18], [19], [20]. 151 In [18], each user encrypts his search queries using his secret 152 key, and those encrypted queries need to be re-encrypted 153 under the data owner's key. This encryption is deterministic 154 so anyone can check whether queries are repeated or not. 155 In [19] and [20], the data owner publishes his public key, 156 and each user makes trapdoors by encrypting their queries 157 using that public key and randomly chosen values. Given 158 the searchable index and trapdoor, an untrusted server can 159 perform an equality test without any form of re-encryption. 160 However, query encryption is dependent on the data owner's 161 public key. Thus, extending their scheme to M/M settings 162 requires each query to be repeatedly encrypted under the 163 public keys of all the data owners. 164 Unfortunately, this efficiency problem in M/M settings has 165 not yet been solved for similarity searches [23], [24], [26], 166 [27]. Similarity search schemes are limited to either S/S [23], 167 [24], [25] or S/M settings [26]. Although Cui [28]. However, this scheme only provides a way to 199 locate similar data stored in the cloud server and lacks a data  data, an adversary cannot perform searches with previously 219 sent queries even if the queries and the newly inserted data 220 contain the same plaintext. Two follow-up studies improve 221 the security by providing backward secrecy [39] or the effi-222 ciency by supporting parallel searches [40]. Unfortunately, 223 all the proposed techniques are limited to S/S settings and 224 single keyword searches. Moreover, forward secrecy only 225 guarantees privacy of previously sent queries, not future 226 queries. Therefore, constructing secure similarity searches in 227 the presence of file-injection attackers, especially for M/M 228 settings, is still the open and challenging problem, which we 229 leave it as a future work.

231
In this section, we briefly introduce locality sensitive hashing 232 and its application to similarity searches, bilinear maps, and 233 security assumptions. We also give algorithmic definitions of 234 similarity searches in M/M settings.

236
Locality sensitive hashing (LSH) is an approximation algo-237 rithm that extracts identifiable features from data to measure 238 the similarity between different items [3]. The basic idea 239 of LSH is to reduce the dimensionality of high-dimensional 240 data using a set of hash functions that map similar items to 241 the same values with high probability. Given a data item, 242 an LSH function extracts a subfeature set {f 1 , . . . , f l }. Then, 243 it conducts equality testing to measure the similarity between 244 the feature sets of two items and this process determines the 245 search accuracy of LSH-based similarity searches.  One can realize similarity search schemes [23], [24], [26] 257 using conjunctive keyword searches [19], [20] by replacing 258 keywords with LSH values and vice versa [27]. Thus, the 259 proposed scheme can also be applied to conjunctive keyword 260 searches. However, none of the existing keyword and similar-261 ity search schemes are asymptotically optimal regarding M/M 262 settings, a problem which we attempt to solve in the context 263 of similarity search.  Algorithm B, which outputs coin ∈ {0, 1}, has advan-320 We define the similarity search scheme for M/M settings. 324 The notations we use are described in Table 1. The scheme 325 consists of the following six algorithms: . The setup algorithm takes as 327 input the security parameter λ. It outputs the public 328 parameter PK and master key MK .   When NMF occurs, deterministic encryption can render the 413 query privacy unattainable since an adversary can infer plain 414 query data [5]. Thus, the best way toward NMF security is to 415 make the query encryption semantically secure. We say that 416 the proposed scheme is NMF-secure if the adversary observes 417 a number of encrypted queries that will result in NMF but 418 cannot distinguish pairs of ciphertexts.

419
It is important to note that guaranteeing NMF security in 420 the presence of such adversary is the same as guarantee-421 ing the semantic security in the presence of eavesdroppers. 422 Therefore, we define NMF security for data privacy using the 423 following game between adversary A and challenger C.
where λ is a security parameter and the probability is taken 441 over the randomness used by A and the randomness used in 442 the data privacy game. 443 By slightly modifying the data privacy game, we can define 444 the query, index, and file privacy game, respectively. That is, 445 NMF security is not only used for query privacy, but it can 446 also be used for index and file privacy. We give their formal 447 privacy games as follows.
. 463 We define the index and file privacy games below similarly. 464 VOLUME 10, 2022 Note that deterministic encryption of queries directly 517 reveals the search pattern in the presence of an eaves-518 dropping adversary, something we want to avoid in our 519 construction. Next, the access pattern refers to the infor-520 mation revealed from the search result, formally written as 521 follows. The access pattern can be used to guess about the trap-528 doors. For example, one trapdoor may return three files, while 529 another may return many, say ten, files. This indicates that 530 the predicate (e.g., a threshold) in the first trapdoor is more 531 restrictive than that in the second.

532
In a similarity search, a query consists of the subfeatures 533 that are used to construct the trapdoors. Thus, we need to 534 capture the search pattern for subfeatures. The similarity 535 pattern is an extension of Definition 2, formally stated as 536 follows.  543 We now formally define a history, trace, and view. In brief, 544 a history is a collection of data and queries generated by data 545 owners and users, respectively. A trace implies all the infor-546 mation that a data owner leaks publicly. A view is the suite 547 of the encrypted history which is accessible to an adversary. 548 Their definitions are given as follows. Given the definitions of trace γ (H n ) and view v(H n ), the 567 security goal we aim to achieve is straightforward: we will 568 build a simulator S who can build a simulated view v S (H n ) 569 from γ (H n ). If an adversary A who has access to a real view 570 v R (H n ) cannot distinguish v R (H n ) from v S (H n ), a similarity 571 search scheme is said to achieve adaptive semantic security. 572 Formally, we have the following definition. For all i ≤ l, it computes 624 K 1,i = g αβ g azt i and K 2,i = g t i . It publishes the 625 secret key as SK = {K 1,i , K 2,i } i∈{1,...,l} . Next, the 626 algorithm chooses a random t and computes D 1 = 627 g αβ g at and D 2 = g t . It publishes the decryption key 628 as DK = {D 1 , D 2 }. While the secret key is used 629 only once to generate a trapdoor, the decryption key 630 is the long-term key used to decrypt a ciphertext. 631 Thus, if the decryption key has already been issued, 632 the algorithm then skips the decryption key genera-633 tion process.  The first effective solution to the efficiency problem is to 692 make secret keys reusable. To this end, we use an additive 693 masking technique [34] such that after receiving SK from 694 the key server for the first time, a user additively masks SK 695 using a randomly chosen value, say χ, to generate a trapdoor. 696 In this way, the user need not interact with the key server for 697 subsequent queries. More precisely, we use every algorithm 698 in the proposed scheme as it is, except for the Trapdoor 699 algorithm, which is modified as follows:

707
Note that since the modified Trapdoor algorithm is run solely 708 by users, per-query interactions between users and the key 709 server are avoided.

710
In terms of security analysis, the randomizing component 711 χ is chosen at random for every query to (1) mask SK , since 712 without χ , one cannot recover SK ; and (2) randomize the 713 trapdoor so that an adversary cannot tell whether queries are 714 repeated by trivially observing them. However, whether it is 715 possible to rigorously prove the security of this approach in 716 the standard model with respect to the data privacy game is 717 unknown (see §IV-B). This is because how to simulate the 718 randomizing component χ in the context of standard-model 719 adversaries is unclear. Thus, we would like to leave the formal 720 security proof of this extension, which is outside the focus of 721 this study for future work.

767
In this section, we prove the proposed scheme guarantees 768 query, index, and file privacy. Then, we prove the proposed 769 scheme is adaptive semantic secure. It is important to note 770 that the adaptive semantic security of the proposed scheme 771 is reduced to the security of data privacy which is formally 772 proved in the standard model.  Setup: B parses y = (g, g α , g β ), chooses a random b ∈ Z p , 783 and implicitly sets b = β + b by computing g b = g β · g b . B 784 computes e(g α , g b ), picks random a, z ∈ Z p , and computes 785 g a , g az . B sets PK = (g, e(g, g) αb , g a , g az ), and gives PK to 786 A. A outputs a pair of subfeatures f 0 and f 1 .

790
Guess: A eventually outputs guess coin for coin. Follow-791 ing this, B outputs 0 to guess that T = g αβ if coin = coin. 792 Otherwise, it outputs 1 to indicate that T is a random element 793 in G T .

800
When the input tuple is sampled from (y, T ), where 801 T = g αβ , then A's view is identical to its view in Game query A 802 and therefore we have Pr[B(y, T = g αβ ) = 0] = 1 2 + 803 Adv A . When the input tuple is sampled from (y, T ), where 804 T is a random group element, then the subfeature is com-805 pletely hidden from the adversary, and therefore we have 806 Pr [B (y, T = R) = 0] = 1 2 . Thus, with g, a, and T uniform 807 in G, Z p , and G T , respectively, B solves the decisional DH 808 problem with a non-negligible advantage. This completes the 809 proof of the theorem.

810
Theorem 2: Suppose the decisional BDH assumption 811 holds. Then, the proposed scheme has index privacy.

812
Proof: Suppose adversary A has the non-negligible 813 advantage = Adv A in Game index A against the proposed 814 scheme. Using A, we show how to build a simulator B that 815 solves the decisional BDH problem in G. 816 2 Note that, given (g, g α , g β ) and T , the simulator can trivially distinguish whether T is g αβ or a random value by determining whether the two values e(g α , g β ) and e(T , g) are the same or not. 919 We have shown that each of the three simulations was taken   [32]. 1006 The proposed scheme and [27] share certain algorithms, 1007 such as Setup, BuildIndex, Trapdoor, and Search. These algo-1008 rithms are evaluated in detail by changing different param-1009 eters including the number of subfeatures (l), the threshold 1010 (th), and the number of data owners (N ). To correctly examine 1011 how each parameter influences the algorithms, we fix two 1012 parameters while modifying the other.

1013
Note that the number of subfeatures l affects search accu-1014 racy and efficiency. Intuitively, a smaller l implies less accu-1015 rate but faster searches, while a larger l implies more accurate 1016 VOLUME 10, 2022  , th, N ). We fix th = 1 and 1051 N = 1 and change l in Fig. 3(a), fix l = 16 and N = 1 and 1052 change th in Fig. 3(b), and fix l = 10 and th = 1 and change 1053 N in Fig. 3(c). While the setup time of the proposed scheme 1054 remains the same regardless of (l, th, N ), the setup time of 1055 Cui et al.'s scheme increases linearly with N in Fig. 3(c). This 1056 is because in Cui et al.'s scheme each user interacts with all 1057 data owners to compute the re-encrypting components. Thus, 1058 the Setup algorithm of the proposed scheme does not depend 1059 on the 3-tuple (l, th, N ), while that of Cui et al.'s scheme 1060 varies with N .

1061
It was observed that the BuildIndex and Trapdoor algo-1062 rithms had linear relationships with l under both schemes 1063 (see Fig. 4(a) and 5(a), respectively). However, while the 1064 time required for BuildIndex increases linearly as l increases, 1065 it remains constant with respect to (th, N ), as shown in 1066 Fig. 4(b) and Fig. 4(c). In a similar way, the time required 1067 for Trapdoor, as seen in Fig. 5(b) and 5(c) remains steady 1068 with respect to (th, N ). In contrast, Fig. 6 illustrates that the 1069 time required for the Search depends on th. We observe that 1070 our scheme takes less search time than Cui et al.'s scheme 1071 does, despite ours requiring more pairing operations. This is 1072 because we use a different type of curve to implement the 1073 proposed scheme. Specifically, ours uses a Type-A curve for 1074 symmetric pairing while Cui et al.'s scheme uses a Type-D 1075 curve for asymmetric pairing. In the PBC library [29], Type-1076 A is the fastest and Type-D is slower but effective when 1077 elements are short. In fact, it is widely known that symmetric 1078 pairings can be implemented much more efficiently than stan-1079 dard asymmetric pairings [41]. Thus, the Search algorithm 1080 takes less time in the proposed scheme.

1091
We present the total running time of our scheme and 1092 Cui et al.'s scheme in Fig. 7. As shown in Fig. 7(a) and 1093 Fig. 7(b), both our scheme and Cui et al.'s scheme have a lin-1094 ear relationship with respect to (l, th). Specifically, Fig. 7(a) 1095 and Fig. 7(b) show that our scheme is faster than Cui et 1096 al.'s scheme. The total running time of Cui et al.'s scheme 1097 becomes much longer when we fix (l, th) and modify N 1098 (Fig. 7(c)) while that of ours remains approximately constant 1099 because none of its algorithms depend on N . To conclude, the 1100 proposed scheme is always faster than Cui et al.'s scheme for 1101 the 3-tuple (l, th, N ).