Scalable and Popularity-Based Secure Deduplication Schemes With Fully Random Tags

It is non-trivial to provide semantic security for user data while achieving deduplication in cloud storage. Some studies deploy a trusted party to store deterministic tags for recording data popularity, then provide different levels of security for data according to popularity. However, deterministic tags are vulnerable to offline brute-force attacks. In this article, we first propose a popularity-based secure deduplication scheme with fully random tags, which avoids the storage of deterministic tags. Our scheme uses homomorphic encryption (HE) to generate comparable random tags to record data popularity and then uses the binary search in the AVL tree to accelerate the tag comparisons. Besides, we find the popularity tamper attacks in existing schemes and design a proof of ownership (PoW) protocol against it. To achieve scalability and updatability, we introduce the multi-key homomorphic proxy re-encryption (MKH-PRE) to design a multi-tenant scheme. Users in different tenants generate tags using different key pairs, and the cross-tenant tags can be compared for equality. Meanwhile, our multi-tenant scheme supports efficient key updates. We give comprehensive security analysis and conduct performance evaluations based on both synthetic and real-world datasets. The results show that our schemes achieve efficient data encryption and key update, and have high storage efficiency.

deduplication to reclaim a lot of storage space.Due to the insecure network environment, users tend to outsource encrypted data to prevent data privacy from being snooped on.However, conventional encryption algorithms aim to provide semantic security for user data.In other words, the encrypted data are indistinguishable from random bits.This property hinders data deduplication since the same messages will be encrypted into indistinguishable ciphertexts.Convergent encryption (CE) [1] is the first attempt to achieve encrypted deduplication.The encryption keys in CE are derived from the data content, so it is a deterministic encryption algorithm and can make sure that identical messages could be encrypted into identical ciphertexts.Nevertheless, CE provides confidentiality only for unpredictable data (the message space can not be exhausted) [2].For predictable data (the message space can be exhausted), CE is vulnerable to offline brute-force attacks [2].
Since semantic security and data deduplication seem to be irreconcilable, some studies attempt to explore a tradeoff between them.For example, some schemes [3], [4] consider data popularity in encrypted deduplication systems.It is a reasonable way to balance data security and storage efficiency by setting different security levels for user data based on their popularity.Some popular songs or movies may be shared by a large number of users, which can be considered popular data.The medical records or scientific research results are generally shared by only a small number of users, which can be considered unpopular data.Obviously, unpopular data require more protection than popular data.In the popularity-based encrypted deduplication scheme, only when the data become popular will they be encrypted with CE and be deduplicated.The unpopular data are encrypted randomly to be provided with semantic security.Stanek et al. [3] use two real-world datasets to analyze the storage efficiency of the popularity-based deduplication scheme.The storage performance is good for datasets containing many files with very high popularity.
Existing popularity-based encrypted deduplication schemes [3], [4] need to deploy a trusted third party to store deterministic tags for recording data popularity.However, if the trusted party is compromised by adversaries, the deterministic tags will be vulnerable to offline brute-force attacks, which is a serious security vulnerability.Besides, we find the "popularity tamper attack" in existing popularity-based schemes [3], [4].Since unpopular data are usually encrypted randomly in popularity-based schemes, it is difficult for the server to verify the data ownership for users.In other words, it is difficult to design proof-of-ownership (PoW) protocols [5], [6] for unpopular data in popularity-based schemes.However, in an encrypted deduplication scheme without the PoW protocol, the data popularity can be tampered with by adversaries with only a small part of the data.The cloud server may increase the number of owners by mistake under the attacks launched by adversaries.We call it the "popularity tamper attack", which will be described in detail in Section II-C.

A. Contribution
In this article, we first present a popularity-based encrypted deduplication scheme with fully random tags.We use homomorphic encryption (HE) to generate random data tags and achieve a ciphertext-level comparison, which eliminates the requirement of storing deterministic tags to record data popularity.Furthermore, we use an AVL tree to store random tags based on their homomorphism and alleviate the inefficiency of the random tag comparison by binary search.Compared with other fully random tag comparison schemes [7], [8], our method does not need the involvement of clients and supports dynamic insertion and deletion of tags.Besides, we find the popularity tamper attack in the popularity-based deduplication and design a PoW protocol based on HE to resist it.Only the users with the whole data content could increase the count of data owners.
Since the multi-tenancy and scalability are crucial in cloud storage, we further expand our scheme with the multi-key homomorphic proxy re-encryption (MKH-PRE) to make it enjoy these properties.In our multi-tenant scheme, users can be divided into different tenants.Cross-tenant users generate their random tags with distinct HE key pairs, while the cloud server can record data popularity by comparing the equality of cross-tenant tags due to the property of MKH-PRE.Note that a tenant is the customer of the cloud storage service.For example, a tenant can be an enterprise or government department.As a side benefit of using MKH-PRE, our multi-tenant scheme supports efficient key updates for random tags based on the proxy re-encryption.Our contributions are summarized as follows: r First, we propose a popularity-based encrypted dedupli- cation scheme, which is the first attempt to record data popularity based on fully random tags instead of deterministic tags.Then, we use the binary search to reduce the time complexity of tag comparison to logarithmic time.
r Second, we point out the popularity tamper attack in exist- ing popularity-based encrypted deduplication schemes [3], [4] and design a PoW protocol based on HE to resist it.
r Third, we introduce the MKH-PRE into our scheme to make it enjoy multi-tenancy and scalability.Besides, the efficient key update can be achieved in our multi-tenant scheme.
r Finally, we give the security analysis and conduct perfor- mance evaluations.The evaluation results show that the introduction of HE and MKH-PRE does not bring much extra computation overheads.Compared with the scheme proposed by Stanek et al. [3], our schemes have a slight improvement in encryption efficiency.This is the full version of the paper that has been accepted by TrustCom 2021 [9].Compared with the conference version, the main differences are as follows: First, we expand our original scheme to a multi-tenant scheme for scalability and updatability by introducing MKH-PRE.Second, we add integrity verification to the original scheme to provide data integrity.Third, we give a more elaborate security model and security analysis for our schemes compared with the conference version.Finally, we give a more comprehensive performance evaluation.We add the performance comparison with [7], add the evaluation of the communication overhead and storage overhead, and evaluate the performance of the key update.

A. Encrypted Deduplication
In conventional encryption algorithms, messages are encrypted into random bit strings and are hard to be deduplicated.Convergent encryption (CE) [1] is the first solution to achieve encrypted deduplication, which derives the encryption key from the data content.It utilizes the property of deterministic encryption to realize encrypted deduplication.Bellare et al. [10] formalize CE as the message-locked encryption (MLE) and analyze its security.They state that MLE is vulnerable to offline brute-force attacks and can only provide confidentiality for unpredictable messages.DupLESS [2] deploys a key server to introduce a system-level secret for MLE to offer confidentiality for both predictable and unpredictable messages.However, the key server may become a single point of failure and efficiency bottleneck.Some schemes [11], [12] [13] focus on designing encrypted deduplication schemes without additional independent servers.But these schemes all require a certain percentage of clients online, which makes it difficult to apply them in real scenarios.In sum, how to provide semantic security for outsourced data while achieving data deduplication is still an important issue.

B. Client-Side Deduplication
Deduplication can be divided into server-side and client-side deduplication.The former needs clients to upload all data to the server, then the server performs the data deduplication.The latter needs clients first to upload data tags to the server.If the server finds that the data have already been stored, then it marks the client as a data owner.Therefore, clients only need to upload the data that are not stored in the server.Compared with server-side deduplication, client-side deduplication can save both bandwidth and storage overheads.
However, client-side deduplication suffers from ownership cheating attacks [5].Concretely, small data tags calculated from the data are used to identify data ownership.The adversary with only data tags can convince the server that it has the whole data and can download data.Halevi et al. [5] find this vulnerability and present a proof-of-ownership (PoW) protocol based on the Merkle tree and specific encodings to let the client efficiently prove to the server that it does have the whole data content.The limitation is that the server in their scheme is trusted.Xu et al. [6] propose a PoW scheme against the honest-but-curious server.Their scheme is proved secure with regard to any distribution with sufficient min-entropy.Most existing PoW schemes [5],  [6] [14], [15] [16], [17] focus on the data encrypted by deterministic encryption (such as MLE) rather than random encryption.How to design PoW for random ciphertexts is still an unsolved problem.
Besides, the duplicate faking attack [10] is also a threat to client-side deduplication.If a malicious uploader sends a correct data tag and a fake ciphertext to the server, the server will only store this fake ciphertext and deduplicate the subsequent uploaded data.Then the subsequent data uploader may download and restore the fake data, and the data integrity is compromised.

C. Popularity-Based Encrypted Deduplication
Recently, several encrypted deduplication schemes [3], [4] try to provide a fine-grained security-to-efficiency trade-off based on data popularity.They set different security levels for user data based on how popular they are (see Fig. 1).If the number of data owners exceeds a popularity threshold, then the data can be considered popular and just need to be encrypted with CE.Otherwise, they should be considered unpopular and be encrypted with semantically-secure encryption algorithms.Stanek et al. [3] use a threshold cryptosystem to design a popularitybased encrypted deduplication scheme.When the data become popular, the storage server (i.e. the cloud server) decrypts the random ciphertexts into convergent ciphertexts using enough decryption shares.PerfectDedup [4] leverages the perfect hash function to enable clients to securely check data popularity without leaking data information to the storage server.However, existing schemes [3], [4] have the following limitations.
r The storage of deterministic tags: Existing schemes [3], [4] need to deploy a trusted third party to maintain the correspondence between deterministic tags and random tags for recording data popularity, as shown in Fig. 2. The purpose of this is to let the honest-but-curious storage server only store random tags instead of deterministic tags (the hashes of data) to prevent data information leakage.But, once the deterministic tags are leaked to an adversary, the semantic security for unpopular data will be compromised.With the increasing amount of outsourced data, the number of deterministic tags that the trusted third party needs to store also increases rapidly.The storage of a large number of deterministic tags greatly increases the risk of data information leakage.
r Popularity tamper attack: Existing schemes [3], [4] are vulnerable to the popularity tamper attack since they do not perform PoW for unpopular data.Specifically, during the uploads for unpopular data, existing schemes cannot verify whether a client indeed has complete data content due to the lack of the PoW for random ciphertexts.So, a malicious client without complete data content can convince the storage server of its data ownership by uploading only a data tag.Assuming that an adversary has the data tag of a target file F t and compromises multiple clients, it can have them upload data tags to the storage server to tamper with data popularity.Once F t becomes popular data, its security protection will be degraded.For predictable data, the adversary can first launch the popularity tamper attack to make them become popular and degrade their security protection, then it can launch offline brute-force attacks to restore data information.In a word, the popularity tamper attack is a serious security concern for popularity-based deduplication schemes.

D. Random Tag
Data tags in encrypted deduplication systems are used to detect data duplication.If two pieces of data are identical, then their tags should be identical.In existing encrypted deduplication systems, the hashes of user data are generally used as data tags.However, the hashes are deterministic and are easy to leak data information.Abadi et al. [8] propose R-MLE2, a fully randomized encrypted deduplication scheme to provide lock-dependent security.They use fully random data tags to detect data duplication.However, the time complexity of their tag comparison algorithm is linear and inefficient.μR-MLE2 [7] uses a decision tree to store random tags and reduces the time complexity of tag equality-testing to logarithmic time by having clients constantly interact with the server.But the insertion or deletion of an intermediate node in the decision tree requires the assistance of clients.Therefore, their scheme requires the assumption that many clients are always online, which makes their scheme less practical.So, the equality-testing of random tags is still a challenge.

A. Convergent Encryption
Convergent encryption (CE) uses the hash of message content as the encryption key to enable deduplication.It can be defined with the following algorithms.

B. AVL Tree
The AVL tree [18] is a self-balancing binary search tree with an equilibrium condition.In other words, the absolute value (equilibrium factor) of the difference between the heights of the left and right subtrees of each node in the tree is at most 1.When inserting or deleting a node, the AVL tree adjusts the related subtree structure according to the algebraic relation between nodes, and always maintains the equilibrium factor to less than or equal to 1. Hence, when the AVL tree contains n nodes, the height h of the tree is log (n).If a search, insert or delete operation is performed, in the worst case, it needs to traverse the height of the tree, and the time complexity is O( log(n)).

C. Homomorphic Encryption
Homomorphic encryption (HE) [19], [20] [21] is a probabilistic encryption algorithm that allows users to perform arithmetic operations on encrypted data without decryption.The HE scheme consists of the following algorithms:

D. MKH-PRE
Multi-key homomorphic encryption (MKHE) [22], [23] [24], [25] is a generalization of HE in the multi-user setting, which supports the homomorphic computation on ciphertexts encrypted with different key pairs.Yasuda et al. [26] propose multikey homomorphic proxy re-encryption (MKH-PRE), which adds the function of the proxy re-encryption on the basis of MKHE.In MKH-PRE, each key pair has a corresponding id.Each ciphertext is accompanied by an id set T , which is used to record all key pairs involved in the encryption process.The set T 0 of a freshly generated ciphertext only consists of one element, while the set T m of the ciphertext after multi-key homomorphism computation may become T m = {id 0 , id 1 , . .., id n }.An MKH-PRE scheme consists of the following algorithms:  r MP.Add(ct 1 , ct 2 , {pk i } i∈T A ): On input ciphertexts (ct 1 , ct 2 ) of (m 1 , m 2 ) and the corresponding public keys {pk i } i∈T A , output the ciphertext ct add of (m 1 + m 2 ).The id set of ct add is T A , which is equal to T 1 ∪ T 2 .Note that T 1 and T 2 respectively denote the id set of c 1 and c 2 .r MP.Sub(ct 1 , ct 2 , {pk i } i∈T A ): On input ciphertexts (ct 1 , ct 2 ) of (m 1 , m 2 ) and the corresponding public keys {pk i } i∈T A , output the ciphertext ct sub of (m 1 − m 2 ).The id set of ct sub is T A , which is equal to T 1 ∪ T 2 .Note that T 1 and T 2 respectively denote the id set of c 1 and c 2 .
r MP.Dec({sk i } i∈T , ct): On input a ciphertext ct and the corresponding secret keys {sk i } in its id set T , output the message m.The decryption of the multi-key ciphertext needs to input all secret keys in the id set.However, it is unreasonable to allow one party to own all secret keys.Therefore, the MKH-PRE provides the distributed decryption, which consists of the PartDec and FinDec.Specifically, one party divides the ciphertext ct into several ciphertext shares ct i according to its id set, and then sends them to each party related to the decryption.Each related party invokes the PartDec to decrypt ct i with its secret key sk i , and outputs a partial decryption share ρ i .Finally, a party collects all ρ i to invoke the FinDec to restore the message m.The definitions of the PartDec and FinDec are described as follows: r PartDec(ct i , sk i ): On input a ciphertext share ct i and a secret key sk i , output a partial decryption share ρ i .
r FinDec({ρ i }): On input all partial decryption shares {ρ i }, output the message m.Besides, the MKH-PRE also supports proxy re-encryption.The algorithms of the re-encryption key generation and reencryption are described as follows: r MP.RKGen(sk A , sk B ): On input the secret keys sk A and sk B , output a re-encryption key rk A→B .r MP.ReEnc(rk A→B , ct A ): On input a re-encryption key rk A→B and a ciphertext ct A , output a re-encrypted ciphertext ct B .Note that ct A and ct B are respectively protected by (pk A , sk A ) and (pk B , sk B ). Remarks: The homomorphic encryption algorithms (HE and MKH-PRE) in our schemes are implemented based on the Brakerski-Fan-Vercauteren (BFV) [27] and NTRU [28].The ciphertext polynomials in BFV and NTRU are both defined over a ring.The plaintexts need to be encoded to plaintext polynomials in binary.We set the plaintext modulus as 3, so a homomorphic ciphertext can be decrypted into a plaintext polynomial with coefficients in {−1, 0, 1}.After we decrypt the output ciphertext of HE.Sub(ct 1 , ct 2 ) and then obtain (m 1 − m 2 ), we could get the algebraic relation between m 1 and m 2 by extracting the highest order coefficient Co of (m

IV. SECURE POPULARITY-BASED DEDUPLICATION SCHEME A. Main Idea
There are several challenges that need to be addressed in popularity-based encrypted deduplication.The first challenge is how to record data popularity without compromising the semantic security of unpopular data.Since data tags are used to perform the duplication check, it is necessary to perform the equality-testing on them to record the count of data owners to reflect the data popularity.It is trivial to perform equality-testing on deterministic tags.But they are vulnerable to offline brute-force attacks.Therefore, we want to design a scheme with fully random tags.However, it is non-trivial to directly perform equality-testing on random tags.We use the homomorphic ciphertexts of deterministic tags as random tags to resolve this issue.HE supports equivalence comparison on ciphertexts without decryption.So, we can perform equalitytesting on random tags.Moreover, HE can also be used to resist offline brute-force attacks, in that it is probabilistic rather than deterministic.As shown in Fig. 3, our scheme deploys a crypto-service provider (CSP ) to manage the key pair of HE and perform the homomorphic decryption.When the storage server (SS) performs the tag equality-testing, it sends the homomorphism subtractions between random tags to CSP .The latter decrypts them and returns the comparison results to SS.
The second challenge is how to improve the efficiency of the equality-testing of random tags.The linear time complexity of the tag comparison is unacceptable when there are a large number of random tags.Our solution is to use binary search.Due to the property of HE, SS can get the algebraic relationship between any two random tags by interacting with CSP .Therefore, random tags could be sorted and then stored in an AVL tree.The time complexity of tag comparison can be reduced to logarithmic time by the binary search in the AVL tree.Another advantage of our scheme is that the equality-testing only requires the interaction between SS and CSP , while clients do not need to be involved.As a consequence, our scheme circumvents the limitation in μR-MLE2 [7].
The third challenge is how to resist the popularity tamper attack.As described in Section II-C, in existing popularity-based deduplication schemes [3], [4], an adversary with only data tags can tamper with data popularity due to the lack of PoW.In the worst case, the adversary can launch the popularity tamper attack to make the target data become popular and then launch offline brute-force attacks to restore data information.Most existing PoW schemes [5], [14] [15], [16], [17] aim at convergent ciphertexts.It is impossible to apply them to resist the popularity tamper attack since the unpopular data are encrypted randomly.Besides, most existing PoW schemes leak data hashes, which will compromise the semantic security of unpopular data.To this end, we design a PoW protocol based on HE to resist the popularity tamper attack without leaking any data information.The idea of our PoW is to let clients compute the hashes of some randomly sampled challenge blocks and then encrypt these hashes with HE.These encrypted hashes are used as the proofs for data ownership.SS can verify whether the proofs are valid through the interaction with CSP .Since the proofs are randomly encrypted, they do not reveal any data information.

B. Architecture
As shown in Fig. 3, our scheme consists of three entities: clients, a storage server (SS), and a crypto-service provider (CSP ).
r The client outsources user data to SS to save local stor- age overhead.To maintain privacy, the client outsources encrypted data to SS.
r The storage server (SS) provides data storage services for multiple users and performs cross-user deduplication to save storage space.
r The crypto-service provider (CSP ) is independent of clients and the storage server.It is responsible for managing the HE key pair.It authenticates clients and distributes the HE public key to SS and all authenticated clients.Note that CSP can be implemented by the third-party external cryptographic service [29], where a cryptographic server performs some cryptographic operations [30] for applications.Many cryptography-based schemes [2], [30], [31], [32], [33] and well-resourced enterprises (e.g.Facebook [34]) deploy external cryptographic services.

C. Threat Model
We assume that CSP is a semi-trusted third party, which is similar to the assumption for the key server in DupLESS [2].In our scheme, CSP cannot learn any data information, but it needs to store the HE secret key sk securely.If an adversary compromises CSP and learns sk, then the security for all user data degrades to convergent security.Besides, we consider the following two kinds of adversaries.
An honest-but-curious (HBC) storage server SS honestly follows our proposed protocol, but attempts to compromise data confidentiality.It can access all outsourced data and attempts to restore data information.
A malicious client C holds a random tag and a portion of the data content of a target file F t .It aims to launch the popularity tamper attack to convince SS of its data ownership and tamper with data popularity.Besides, it tries to launch the duplicate faking attack for F t to compromise the data integrity of other users.

D. Security Goal
The security goals of our scheme are as follows.Data Confidentiality.Our scheme provides semantic security and convergent security for unpopular data and popular data, respectively.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Data Integrity.Our scheme provides integrity for outsourced data.Any adversaries cannot tamper with user data stored in SS without being detected.
Tag Consistency.Our scheme provides tag consistency including validity and security.
1) Validity: Only if two pieces of data are identical can SS interact with CSP to identify that two tags are identical.2) Security: Tags should be indistinguishable from random bit strings of equal length.Attacks Resistance: Our scheme resists the typical attacks in encrypted deduplication, such as brute-force attacks [2], popularity tamper attacks [9], ownership cheating attacks [5], and duplicate faking attacks [10].Any adversaries cannot launch these attacks to compromise data confidentiality or integrity.

E. Definition
Let λ and M be a security parameter and the message space respectively.We denote by F({A i ({x i })}) → {a i } the event that multiple parties {A i } jointly engage in the protocol F with the inputs {x i } and outputs {a i }.The empty string is denoted by ε.
Our scheme consists of the algorithms and protocols (KG, RTG, PopDet, PoW, Enc, Dec, CV, IV), which we define as follows: KG (λ) → (pk, sk): CSP uses a security parameter λ to invoke the key generation algorithm to generate a key pair (pk, sk).After that, we assume that all parties take λ as input in all algorithms and protocols.
RTG (pk, m) → rT : The random tag generation algorithm takes the public key pk and a message m ∈ M as input, and outputs a random tag rT .
PopDet (SS(rT, T A ), CSP (sk)) → (ctr, ε): During the popularity detection protocol, SS inputs a random tag rT and an AVL tree T A , while CSP inputs the secret key sk.When the protocol concludes, SS outputs the data popularity ctr, while CSP outputs nothing, denoted by the empty string ε.

F. Construction
Here, we present the constructions of the algorithms and protocols in our scheme.
Key Generation (KG): Our scheme follows the key generation algorithm in HE.During the system setup, CSP runs HE.KeyGen(λ) to generate HE key pair (pk, sk).

Random Tag Generation (RTG):
To record data popularity, we generate random tags based on HE.Specifically, to upload an outsourced file F , client C first uses CE to generate a convergent ciphertext F c = CE.Enc(k c , F ), where k c is the convergent key (k c = CE.KeyGen(F )).Then, C generates a deterministic tag dT = H(F c ), where H(•) is a cryptographic hash function.The purpose of using the hash of the convergent ciphertext as the deterministic tag is to prevent duplicate faking attacks [10], which is explained in detail in Section VI-C After that, C encrypts dT with HE to obtain a random tag rT = HE.Enc(pk, dT ), which can be used for popularity detection and data retrieval.
Popularity Detection (PopDet): After the generation of rT , C sends it to SS for popularity detection.SS can determine the algebraic relation between any two random tags by the interactions with CSP .In consequence, all random tags could form an ordered set and be stored in an AVL tree.After receiving rT , SS first finds the root node rT 1 of the AVL tree T A (see Fig. 4).Then it computes a tag equality-testing ciphertext etc t = HE.Sub(rT, rT 1 ) and sends it to CSP .CSP decrypts etc t with sk, performs a function I(•) on the decryption result to obtain the tag comparison result rs t ∈ {−1, 0, 1} (see ( 1)), and returns rs t to SS.Note that I(•) is a function that outputs the highest order coefficient of the plaintext polynomial.As described in Section III-D, we set the plaintext modulus as 3. So, we can get the algebraic relation between data tags by computing the highest order coefficient of the subtraction of them.
As shown in Fig. 4, SS can use multiple tag equality-testing ciphertexts {etc t } to interact with CSP and then eventually find the node corresponding to rT .Then, SS can detect the popularity ctr (number of owners) for the data corresponding to the random tag of this node.Note that SS will create a new node for rT if it can not match any existing nodes in T A .The time complexity of the equality-testing for random tags is logarithmic time, and the tag storage structure supports efficient node insertion and deletion due to the property of the AVL tree.Besides, neither the process of the tag comparison nor the node update requires client involvement, so we do not need to assume that the clients are online.
Proof of Ownership (PoW): The random tag cannot be considered as the proof of data ownership, since adversaries may learn deterministic tags of user data (especially for predictable data) and the HE public key pk is publicly known.So, we design a PoW protocol to resist both the ownership cheating attack and popularity tamper attack.Note that the PoW protocol needs to be run for both the uploads of popular data and unpopular data.
We assume that client C uploads file F , its random tag and convergent ciphertext are rT and F c respectively.We also assume that this is the first upload of F .After the popularity detection, SS runs PoW with C. Specifically, SS generates two random seeds (c 0 , c 1 ) and sends them to C. Note that the current proof set P set for rT is empty since this is the first upload.C takes (c 0 , c 1 ) as seeds to generate two random sequences of block index {c 01 , c 02 , . .., c 0n } and {c 11 , c 12 , . .., c 1n } (the length of each index sequence is set to be n).
, where l F denotes the length of the block set of , and then sends (p 0 , p 1 ) to SS.The latter inserts {(c 0 , p 0 ), (c 1 , p 1 )} into the proof set P set for subsequent uploads and then add C into the list of owners.
If another client C uploads file F and its random tag is equal to rT , SS will assume that C may be the second uploader of In our PoW, SS sends two seeds (c u , c u ) to client C, where c u is selected from the proof set P set and c u is freshly generated.C generates two proofs (p u , p u ) based on (c u , c u ).Note that these seeds and proofs have different roles.The role of (c u , p u ) is to allow SS to verify that the file of C is consistent with previously stored files and that C has complete data content.Whereas (c u , p u ) is used to increase the size of P set .If only a fixed seed and proof are used in each PoW, the compromise of them will allow the adversary without complete content to pass PoW.If SS randomly generates a seed each time, the verification will be infeasible because SS does not learn convergent ciphertexts of unpopular data and cannot generate valid proofs for verification.Therefore, SS constructs the proof set P set and selects a seed from it for verification during each PoW.Increasing the size of P set decreases the probability that the adversary passes PoW after a certain seed and proof are compromised.The exception is the first uploader C 1 .SS uses two freshly generated seeds to run PoW with it since P set is empty at this point.If C 1 uploads fake proofs, SS can detect that during the PoW for subsequent uploaders.
Enc and Dec: After the popularity detection and PoW, SS may require the client to upload encrypted data.Our scheme employs the symmetric encryption algorithm (e.g.AES).If the outsourced data are unpopular, the client performs random encryption to provide semantic security.Otherwise, the client performs CE to encrypt popular data to provide convergent security.Also, the client performs symmetric decryption to restore data.
Ciphertext Verification (CV): The ciphertext verification protocol is designed to prevent a malicious client from uploading correct data tags and forged convergent ciphertexts to launch duplicate faking attacks.After receiving random tag rT and convergent ciphertext F c from the client, SS computes a random tag rT = HE.Enc(pk, H(F c )) and a tag consistency ciphertext etc h = HE.Sub(rT, rT ), and then sends etc h to CSP .The latter returns a comparison result rs h = I(HE.Dec(sk, etc h )).If rs h is equal to 0, then F c is consistent with rT , and SS outputs a flag f = 1 to indicate the success of ciphertext verification.Otherwise, SS outputs f = 0 to indicate failure.
Integrity Verification (IV): After the client downloads and restores its outsourced file F , it runs the integrity verification algorithm to check data integrity.It computes a convergent key k c = CE.KeyGen(F ) and then checks whether k c is equal to the original convergent key k c .If so, it outputs a flag f = 1.Otherwise, it outputs f = 0.

G. The Workflow of the Scheme
The workflow consists of the system setup, data upload, and data download.The specific processes are as follows.
System Setup: CSP runs KG(λ) to generate a system-level HE key pair (pk, sk), sends the public key pk to SS, and keeps the secret key sk secretly.Once a client joins the system, it can obtain pk after authenticating with CSP .SS sets a popularity threshold t for all user data.It is feasible to set different thresholds for different data.Here, we use a uniform threshold t for simplicity.We further discuss this issue in Section V-F.
Data Upload: As shown in Fig. 5, to upload a file F , client C first runs RTG(pk, F ) to get a random tag rT , and then sends rT to SS. SS performs the popularity detection after receiving rT .Specifically, SS inputs rT and AVL tree T A to run PopDet with CSP .After the popularity detection protocol, SS can find the corresponding node of rT in T A after multiple equality tests.The storage structure of SS is shown in Fig. 6.Each node corresponds to a representative random tag Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and a set of random tags that are equal to rT i after the HE decryption.The representative tag is used as an index, and it could be any element of the set.SS looks over the current owner number for the node corresponding to rT to detect data popularity.For example, if rT is equal to rT 2 after the HE decryption, then its popularity is 101.
After the popularity detection, SS runs PoW with C. SS increases the owner number by one and adds C into the list of owners only if C passes PoW.If the number of owners for F has not reached the threshold t, it means that F is still unpopular, then SS will set an uploading response urs to up.If the number of owners just reaches t, urs is set to pc.Otherwise, urs is set to p. SS returns the uploading response urs to C.
If urs is equal to up, which means that F is unpopular, C needs to upload a random ciphertext of F .C generates a random key k r ← {0, 1} λ and computes a random ciphertext F r = SE.Enc(k r , F ), where SE.Enc(•) is a symmetric encryption (SE) algorithm.C uploads F r and stores (rT, k c , k r ) locally for downloading and restoring F .SS stores rT and F r for C. SS only deduplicates popular data, while unpopular data are randomly encrypted and cannot be deduplicated.As shown in Fig. 6, if the file corresponding to a random tag is unpopular (e.g.rT 1 and rT n ), SS will store all random tags that are equivalent to that random tag along with their corresponding random ciphertexts.If the file is popular, SS only stores one copy of the convergent ciphertext (e.g.F c 2 ).
If urs is equal to pc, which means that the popularity conversion needs to be performed.C needs to upload a convergent ciphertext F c .After receiving F c , SS inputs (pk, rT, F c ) to run the ciphertext verification protocol with CSP to resist duplicate faking attacks.When the protocol concludes, if SS outputs f = 1, then the protocol is successful, and all random ciphertexts corresponding to rT will be removed to save storage space.Otherwise, C may upload a fake ciphertext and launch the duplicate faking attack, then SS aborts the popularity conversion process.
If urs is equal to p, which means that F is popular, C does not need to upload a complete encrypted file, while SS just needs to add C to the owner list.

V. THE MULTI-TENANT POPULARITY-BASED SECURE DEDUPLICATION SCHEME
In this section, we first point out the limitations of the scheme in Section IV.Then, we propose a multi-tenant popularitybased secure deduplication scheme to address these limitations.Hereinafter, we call the scheme in Section IV the single-tenant scheme and call the scheme in this section the multi-tenant scheme.

A. Limitations of the Single-Tenant Scheme
The single-tenant scheme has the following limitations.The first limitation is that the centralized CSP may become an efficiency bottleneck.As the number of users increases, the computation overhead of CSP increases significantly, which makes the system hard to scale up.A straightforward solution is to deploy multiple CSP s with an identical system-wide shared HE key pair.Nevertheless, this key pair will become a single point of failure.If an adversary breaks any one of CSP s and gets the key pair, the semantic security of all unpopular data will be compromised, which is a serious vulnerability.The second limitation is that the single-tenant scheme cannot provide the key rotation.Once the HE key pair is leaked, all tags need to be regenerated, which is inconvenient and computationally expensive.

B. Overview of the Multi-Tenant Scheme
We introduce MKH-PRE into our single-tenant scheme to design a multi-tenant scheme, which achieves both scalability and updatability.The architecture of the multi-tenant scheme is shown in Fig. 7. Users are divided into different tenants.Each tenant deploys a CSP for managing the HE key pair.To avoid the single point of failure, different CSP s hold distinct HE key pairs.Based on the property of MKH-PRE, SS can perform the equality-testing on cross-tenant random tags by interacting with multiple CSP s.Furthermore, the multi-tenant scheme also supports key rotation since MKH-PRE has the proxy re-encryption algorithm.The threat model and security goal of the multi-tenant scheme are similar to the single-tenant scheme (see Sections IV-C and IV-D).The only difference is that the multi-tenant scheme adds a security goal of forward security.That is, even if an adversary learns original system-level secret keys and the updated outsourced data, it cannot restore any data information.

C. Definition of the Multi-Tenant Scheme
The multi-tenant scheme consists of (KG, MRTG, CPopDet, PoW, Enc, Dec, CV, IV, RKGen, Update).We define MRTG, CPopDet, RKGen, and Update as follows, while the others are the same as the single-tenant scheme.
MRTG (pk, m) → (rT, ID S ): The multi-tenant random tag generation algorithm takes the public key pk and a message m ∈ M as input, and outputs a random tag rT and an id set ID S of rT .
CPopDet (SS(rT, ID S , T A ), {CSP i (sk i )}) → (ctr, {ε}): The cross-tenant popularity detection protocol is run by SS and multiple CSP s.SS inputs a random tag rT , its id set ID S , and an AVL tree T A while each CSP i inputs its secret key sk i .When the protocol concludes, SS outputs the data popularity ctr, while CSP s output nothing.
RKGen (sk, sk ) → rk: The re-encryption key generation algorithm takes two secret keys (sk, sk ) of MKH-PRE as input, and outputs a re-encryption key rk.
Update (rk i , {rT i }) → {rT i }: The update algorithm takes a re-encryption key rk i and a tag set {rT i } as input, and outputs an updated tag set {rT i }.

D. Construction of the Multi-Tenant Scheme
Here, we present the constructions of the algorithms and protocols in the multi-tenant scheme.
Key Generation (KG): All CSP s run MP.KeyGen(λ) to generate key pairs.
Multi-Tenant Random Tag Generation (MRTG): We take client C i in tenant T i as an example for illustration.C i first computes a deterministic tag dT , which is the same as the single-tenant scheme.Then, it uses the public key pk i of T i and dT to generate a random tag rT = MP.Enc(pk i , dT ), whose id set is {i}.
Cross-Tenant Popularity Detection (CPopDet): After receiving rT and {i}, SS first finds the root node rT 1 of the AVL tree T A (see Fig. 8).Different from the single-tenant scheme, the id set {j} of rT 1 may be not equal to that of rT .If two id sets are the same, then the process of popularity detection is the same as the single-tenant scheme.Otherwise, SS computes a tag equalitytesting ciphertext etc t = MP.Sub(rT 1 , rT, {pk i , pk j }), extracts (etc T i , etc T j ) from etc t , and then sends them to CSP i and CSP j respectively.Note that the id set of etc t is {i, j}.CSP i and CSP j respectively run PartDec(etc T i , sk i ) and PartDec(etc T j , sk j ) to get decryption shares ρ i and ρ j , and return them back to SS.The latter invokes I(FinDec(ρ i , ρ j )) to restore the final decryption result rs t ∈ {−1, 0, 1}.Then SS could determine the algebraic relation between rT 1 and rT based on rs t .
As shown in Fig. 8, SS can repeat the above steps to find the corresponding node of rT in T A , and then obtain the cross-tenant data popularity ctr.If rT cannot match any existing node, SS will create a new node for it.In this case, ctr will be set to 1, if the client passes PoW later.The cross-tenant popularity detection is efficient due to the distributed decryption, which can be demonstrated in Section VII-C.
Re-Encryption Key Generation (RKGen): CSP i of tenant T i inputs an original secret key sk i and a new secret key sk i to run MP.RKGen(sk i , sk i ), and outputs a re-encryption key rk i .
Update: After receiving a re-encryption key rk i from CSP i , SS first finds all random tags whose id set is {i} to form a tag set {rT i }.Then, SS runs MP.ReEnc(rk i , rT i ) on all elements in {rT i } and then obtains all updated tags to form an updated tag set {rT i }.
The constructions of other algorithms and protocols are the same as the single-tenant scheme.

E. The Workflow of the Multi-Tenant Scheme
The workflow consists of the system setup, data upload/download, and key update.The specific processes are as follows.
System Setup: All CSP s run KG(λ) to generate key pairs and send their public keys to SS.The key pair of CSP i is (pk i , sk i ).When a client C i of tenant T i joins the system, it can obtain the public key pk i after authenticating with CSP i .SS sets a uniform popularity threshold t for all user data from different tenants.The setting of t will be further discussed in Section V-F.
Data Upload/Download: When a client C i of tenant T i outsources its file F to SS, it runs MRTG(pk i , F ) to generate a random tag rT and its id set {i}, and then sends (rT, {i}) to SS.The latter runs CPopDet with multiple CSP s to get the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
popularity ctr i of rT and then runs PoW (the same as the singletenant scheme) with C i .If C i passes PoW, SS increases the data popularity by one and adds C i into the list of owners.The following steps are the same as the single-tenant scheme.Specifically, SS returns an uploading response urs ∈ {up, pc, p} according to the current data popularity, while C i uploads data in different ways according to urs (see Section IV-G).The steps for data download are the same as the single-tenant scheme.
Key Update: If HE secret keys are compromised by adversaries, then the security of unpopular data will be reduced to convergent security.Hence, CSP s rotate HE key pairs as a regular routine to protect against key compromise.In the multi-tenant scheme, CSP s of different tenants can update their keys independently.If CSP i performs the key update, it invokes KG(λ) to generate a new key pair (pk i , sk i ).Then, it uses the original secret key sk i and the new secret key sk i to invoke RKGen(sk i , sk i ) to get a re-encryption key rk i , and then sends rk i to SS.The latter runs Update to get an updated tag set {rT i }.After the key update, the updated tags are protected by the new key pair (pk i , sk i ), while the original key pair (pk i , sk i ) is invalid.

F. Discussion
Here, we discuss some issues of our schemes.Security Protection: Compared to the state-of-the-art popularity-based encrypted deduplication schemes [3], [4], our schemes need a weaker security assumption and provide stronger security protection.First, our schemes deploy a semitrusted third party (CSP ) rather than a fully trusted party.Second, our schemes perform PoW to resist the popularity tamper attack, while existing schemes are vulnerable to it.Besides, we shrink the size of the data that needs to be securely stored.Recall that existing schemes need to maintain confidentiality for all deterministic tags, while our schemes just need to protect HE secret keys.This feature reduces the size of sensitive data that needs to be protected, while reducing the risk of sensitive data leakage.
Popularity Threshold: We use a uniform threshold t to classify data as popular or unpopular for simplicity.Actually, SS can set various thresholds based on real scenarios.Note that the number of thresholds needs to be controlled to prevent uncontrollable boom [3].After SS publishes candidate thresholds, users can choose a preferred threshold for their outsourced data.During the data upload, the client needs to state the threshold it chooses for outsourced data.The popularity of the same file with distinct thresholds needs to be recorded separately [3].Otherwise, security will be easily compromised.
User Revocation: For cloud storage systems, user revocation is an important issue.In our schemes, to revoke an owner U r of an unpopular file F r , SS removes U r from the owner list, decreases the owner number of F r by one, and then deletes the random tags and random ciphertexts uploaded by U r .Since unpopular data are encrypted by random keys, a revoked user cannot restore any data information even if it intercepts the unpopular data uploaded by other users.To efficiently revoke an owner of a popular file, we can tweak our schemes by using the encryption schemes in REED [33] to encrypt popular data.Specifically, when uploading popular data, clients use convergent all-or-nothing transform (CAONT [35]) to transform data into trimmed packages and stubs.Then the trimmed packages are encrypted with deterministic encryption algorithms and are used to be deduplicated to save storage space, while the stubs are encrypted with random keys and are used to perform the efficient key update and user revocation.As a result, our scheme can revoke users' access rights by re-encrypting the stubs of popular data.The user revocation is efficient since the size of the stub is small (0.78% of the original data [33]), and the security can be reduced to the property of CAONT.

VI. SECURITY ANALYSIS
In this section, we analyze the security of our schemes under our threat model (see Section IV-C).

A. Data Confidentiality
Both single-tenant and multi-tenant schemes encrypt popular data with CE, so they are provided with convergent security, as analyzed in [10].As a result, our security analysis focuses on the semantic security of unpopular data.We take the single-tenant scheme as an example to analyze the security for unpopular data, while the analysis for the multi-tenant scheme is similar.The specific security model is defined by a security game G played between adversary A and challenger C below, which respectively model an honest-but-curious SS and an honest client.
Setup: C runs KG(λ) to generate a HE key pair (pk, sk) and sends pk to A.
Query: A adaptively issues queries on selected messages.For a queried message m i , C generates a random symmetric key k i , invokes SE.Enc(k i , m i ) to generate a symmetric ciphertext C m i , and then returns C m i to A. Note that C generates a fresh symmetric key for each query.We define the advantage Adv A of A in the above game G as where the probability is over the random bits used by the challenger and the adversary.Definition 1.Our scheme provides semantic security for unpopular data only if for any probabilistic polynomial-time (PPT) adversary A, there exists a negligible function negl(λ) such that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Theorem 1.If SE and HE are semantically secure, then our scheme provides semantic security for unpopular data.
Proof.We prove by defining a sequence of indistinguishable security games, each differs slightly from the previous.In G 2 , the information available to A are (rT r , C r , P r ), which are all uniformly random values.Therefore, in G 2 , A could only outputs b randomly.So, its probability advantage Adv A is negligible.Since G 2 and G 0 are indistinguishable, the probability advantage for A in G 0 is negligible.
The Security of the Multi-Tenant Scheme: The security analysis for the multi-tenant scheme is similar to the single-tenant scheme.The only difference is that the semantic security of unpopular data is reduced to MKH-PRE.If there is a PPT adversary A compromises the semantic security for unpopular data, then we can construct another adversary A to break the semantic security of MKH-PRE or SE.
Remarks: Note that A could learn the owner number of a file corresponding to a random tag (see Fig. 6).But this information does not affect the semantic/convergent security of unpopular/popular data, since all outsourced data are semantically/convergently encrypted.

B. Attack Resistance
We analyze the resistance to popularity tamper attacks (PTA), duplicate faking attacks (DFA), and brute-force attacks (BFA) for our schemes.Note that the resistance to PTA is the same as the resistance to ownership cheating attacks.The analysis for these attacks in the single-tenant and multi-tenant schemes is the same, so we do not differentiate them.
Definition 2. Let be a security parameter, if the probability P pass that a malicious client passes PoW satisfies that P pass ≤ 2 − , then our schemes achieve the resistance to PTA.
Theorem 2. Assuming that there are no collisions in the hash function, and the malicious client owns a certain portion p of a target file F , then our schemes could achieve the resistance to PTA if the number n of challenge blocks in PoW satisfies that n ≥ − • ln 2 ln p .Proof.Assuming that the malicious client A e knows p of F .In other words, given any block B of F to A e , the probability that it owns B is p.For any challenge block, the probability that A e owns it is p.If A e has the challenge block, it can use its hash to compute the proof of ownership.Otherwise, it can only guess the hash of the challenge block.Suppose that the length of the hash is K bits, then the probability that A e guesses the correct hash of a challenge block is 2 −K .So, we can deduce that the probability that A e passes PoW is can be considered a negligible value.Therefore, we have P pass ≈ p n .To resist PTA, we need to ensure that P pass ≤ 2 − .Through simple arithmetic, we can deduce that the minimum bound of n is − • ln 2 ln p .Theorem 3. Assuming that there are no collisions in the hash function, then our schemes could achieve resistance to DFA.
Proof.If a malicious client A e successfully launches DFA, it needs to let SS output f = 1 during the ciphertext verification.In other words, its forged convergent ciphertext F c (F c = F c ) needs to satisfy H(F c ) = H(F c ), where F c is an honestly generated ciphertext.In this case, A e finds collisions in the hash function, which breaks the assumption.
Theorem 4. If both SE and HE are semantically secure, then our schemes could resist BFA for unpopular data and unpredictable popular data.
Proof.All tags, proofs, and communication transcripts between SS and CSP are encrypted by HE.Based on the semantic security of HE, any adversary can not launch BFA on them to restore data information.Unpopular data are encrypted with SE, so the BFA on them can not work if SE is semantically secure.Popular data are encrypted with CE, so their semantic security can be achieved if they are unpredictable.If the popular data are predictable, then they are vulnerable to BFA [2].

C. Data Integrity
The honest-but-curious SS can access all outsourced data.However, based on the HBC assumption, it does not maliciously modify, delete, or destruct user data.This is a reasonable assumption since SS needs to maintain its industry reputation and accountability.For a malicious client A e , according to Theorem 3, it can not successfully launch DFA to compromise data integrity without being detected.For legitimate users, they can use random keys or convergent keys to restore outsourced data and then verify the data integrity via integrity verification.Note that the key generation of CE is a hash function.If the data integrity is compromised, and the integrity verification can be passed, then the adversary finds collisions in the hash function, which breaks the assumption.

D. Tag Consistency
Based on the collision resistance of the hash function and the decryption correctness of HE, our data tags achieve validity.Assuming that a malicious client A e constructs the same tag rT for two distinct files F 1 and F 2 .Based on the decryption correctness of HE, we have H(F 1 c ) = H(F 2 c ), where F 1 c and F 2 c are the convergent ciphertexts of F 1 and F 2 respectively.Since F 1 = F 2 , then the probability that F 1 c = F 2 c is negligible.Therefore, C finds a collision in the hash function, which breaks the assumption.Besides, since random tags are encrypted by HE, they are indistinguishable from random bit strings of equal length based on the semantic security of HE, which proves that our schemes provide security for tags.

E. Forward Security
The forward security of the multi-tenant scheme can be easily drawn from the security of the proxy re-encryption in MKH-PRE.Based on the security of the proxy re-encryption, the original keys are useless after the key rotation.The original keys and new keys are both uniformly and randomly chosen from the key space, and the updated tags are consistent with the new keys and are indistinguishable from random bit strings based on the security of MKH-PRE.

F. Security Comparison
We compare the security properties of our schemes with stateof-the-art schemes [3], [4].As shown in Table I, existing schemes can not provide resistance to PTA, since they do not introduce the PoW protocol.Compared to the single-tenant scheme, the multitenant scheme further provides scalability and updatability based on the properties of MKH-PRE.Note that the confidentiality in Table I refers to the semantic security of unpopular data and the convergent security of popular data.

VII. EVALUATION
We implement prototypes of our schemes, which consist of three entities: the client C, the storage server SS, and the crypto-service provider CSP .We use MD5 as the tag generation function for deterministic data tags and SHA-256 as the key generation function for convergent keys.The symmetric encryption algorithm used in our prototypes is AES-128-CTR.These cryptographic primitives are all implemented based on the OpenSSL library [36].The codes of HE and MKH-PRE are implemented in different ways.HE can be implemented more efficiently than MKH-PRE, since it does not need to support the proxy re-encryption or homomorphic computation on ciphertexts under different key pairs.The code of HE is implemented based on the open-source library SEAL [37], [38].The MKH-PRE is implemented based on the schemes of [39] and [40], and the code of MKH-PRE is implemented based on FNTRU [41].All entities in our prototypes are implemented in C++.
We also implement the prototypes of existing schemes [3], [7].We call the scheme proposed by Stanek et al. [3] Sdedup for short.We instantiate the bilinear map in the prototypes of [3] and [7] with type-F pairing provided by the PBC library [42].Our experiments run on a machine equipped with a quad-core 2.7 GHz Intel Core-i7-7500 U, 5400RPM SATA hard disk, and 8 GB RAM, and installed with 64-bit Ubuntu 20.04.12.We first evaluate the time overheads of data upload in three scenarios: unpopular data upload, popularity conversion, and popular data upload.Then, we evaluate the overheads of the tag equality-testing and key update to demonstrate their efficiency.Finally, we use a real-world dataset to evaluate the storage overhead.

A. Comparison With Sdedup
One of the differences between our schemes and Sdedup is that we use HE/MKH-PRE to generate random tags and introduce a PoW protocol, while Sdedup uses the convergent threshold cryptosystem.Thus, we compare the performance of the cryptography tools in our schemes and Sdedup.The performance comparison is shown in Fig. 9.
We select a test file F t and analyze the performances of various cryptography tools by evaluating the time overheads of generating the random tag and proofs of ownership.Note that the overheads of the generations of the random tag and proofs all depend only on the file hash, not the file size, so we do not specify the size of F t .The PoW protocols in single-tenant and multitenant schemes are denoted by PoW-S and PoW-M in Fig. 9 respectively since different homomorphic encryption algorithms are used.The time overhead of the threshold cryptosystem in Sdedup is 31.1 ms, while the counterparts of HE/MKH-PRE and PoW-S/PoW-H are 0.4/3.3ms and 2.8/7.9 ms respectively.Therefore, compared with Sdedup, the cryptography tools used in both single-tenant and multi-tenant schemes are more efficient.Besides, we can find that the speed of HE encryption is significantly faster than that of MKH-PRE.
Then, we compare the time overheads for uploading unpopular data among Sdedup, our single-tenant and multi-tenant schemes with multiple test files ranging in size from 1 MiB  to 500 MiB.The result is shown in Fig. 10.The time overheads include the overheads of random tag generation, data encryption, PoW (only in our schemes), and data communication.When uploading a file of 1 MiB, our single-tenant and multi-tenant schemes can respectively reduce 51.9% and 17.4% time overheads compared with Sdedup.When the file size becomes 500 MiB, these two ratios become 1.0% and 0.9%.We can find that the time overheads for uploading unpopular files with relatively large sizes are close among these three schemes, although our schemes seem to have slightly less overhead.The reason is that the overheads of computing file hashes and data encryption account for about 90% of the total, and these two parts in Sdedup and our schemes are similar.The processes of popularity conversion and uploading popular data in Sdedup and our schemes are basically the same, so the time overheads of these two processes are also close.

B. Performance Comparison Under Three Scenarios
We take the single-tenant scheme as an example to evaluate the time overhead of each part in three scenarios.The situation of the multi-tenant scheme is similar.The size of the test file is 10 MiB.We assume that SS has already stored 10,000 random tags.There are six main components in the process of data upload: random tag generation, PoW, popularity detection, ciphertext verification, data encryption (the random encryption for unpopular data), and communication.
As shown in Fig. 11, the overheads of random tag generation respectively account for 60.2%, 64.4%, and 95.2% of the total overheads in these three scenarios, which are the most time-consuming.The overheads of the random tag generation include the overheads of generating the convergent ciphertext, deterministic tag, and random tag.The overheads of PoW and popularity detection in these scenarios only account for 1.8% ∼ 2.6% and 0.3% ∼ 0.6% respectively, which only have little impact on performance.Compared with the unpopular data upload, the popularity conversion has the overhead of ciphertext verification instead of data encryption.The overheads of the ciphertext validation include the overheads of generating the deterministic tag, once HE encryption, and once HE decryption, which are close to the overheads of the data encryption.So, these two scenarios have close overheads.It is obvious that the popular data upload has the lowest overhead since it requires neither data encryption nor ciphertext verification.The communication overhead in popular data upload is also very low since the whole data do not need to be uploaded.As a result, the overhead for the popular data upload is reduced by 39.2% and 35.8% respectively compared with the unpopular data upload and popularity conversion.
We use multiple test files to evaluate the time overheads for data upload in three scenarios.In Fig. 12, we can see that the overhead for uploading data will become lower once the data become popular.When the size of the outsourced data is 500 MiB, uploading popular data can save 34.6% and 32.2% of the time overhead compared with uploading unpopular data and popularity conversion.

C. Performance of Tag Equality-Testing
We compare the time overheads of the linear and binary search for random tags.The result is shown in Fig. 13.The performance evaluation of linear and binary searches is under the single-tenant scheme.We insert 1,000 random tags continuously into the linear structure and the AVL tree.The binary search based on the AVL tree is more efficient.The insertion of the 1,000th random tag takes only 15 ms using binary search, while the delay of the linear search is 99 ms.
The overheads of inserting 1,000 random tags continuously in our single-tenant/ multi-tenant schemes and the static/dynamic schemes in [7] are shown in Fig. 14.We can find that the single-tenant and multi-tenant schemes respectively have the highest and lowest overheads for the tag equality-testing, while the static and dynamic schemes in [7] are in the middle.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The use of the MKH-PRE mildly increases the overhead of tag equality-testing.But continuously inserting 1,000 random tags into the AVL tree in multi-tenant schemes only needs 71 ms, which is also very efficient.The distributed decryption in MKH-PRE can improve the performance of the tag equalitytesting in the multi-tenant scheme.The decryption processes of multiple CSP s are performed in parallel, which ensures that the multi-keys decryption does not bring significant time overhead.Note that the fluctuation in Fig. 14 is mainly due to the instability of the decryption time and the adjustment of the tree height when inserting nodes.Compared with [7], the superiority of our schemes is that our tag equality-testing does not need the involvement of clients and does not require the assumption that the clients are online when performing the tag equality-testing.During the evaluation of the static/dynamic schemes in [7], we set up a client that is always online.
We also evaluate the overhead of each part in the tag equalitytesting.Specifically, we evaluate the overheads of inserting 10, 100, 200, 300, 400, and 500 tags, respectively.Fig. 15 shows the evaluation result in the single-tenant scheme.The overheads of the tag equality-testing include the overheads of the subtraction of multiple HE ciphertexts (denoted as HE.Sub), the HE decryption (denoted as HE.Dec), the adjustment of the AVL tree, and the communication overhead.The overheads for the HE decryption and communication dominate the total.These overheads account for more than 90% of the total.Fig. 16 shows the evaluation result in the multi-tenant scheme.The overhead for the HE decryption (MP.Dec) accounts for more than 94% of the total.The reason is that the homomorphic decryption in    MKH-PRE is significantly slower than HE.Besides, we can find that the communication overhead does degrade the performance in the single-tenant scheme, but in the multi-tenant scheme, the performance bottleneck of the tag equality-testing is the homomorphic decryption.

D. Performance of Key Update
The performance of the key update in the multi-tenant scheme is shown in Fig. 17.We can find that the key update is efficient.Update 500 random tags only needs less than 2.6 s.The reason is that our key update is transparent to users, and the process of the re-encryption only consists of twice polynomial multiplications.The re-encryption is implemented based on NTRU [28], [40], in which the polynomial coefficients are relatively small, and the polynomial multiplications can be efficiently performed using the Fast Fourier Transform (FFT) [43].

E. Storage Efficiency
We evaluate the storage efficiency of our schemes based on a real-world dataset: PB dataset [44], which contains the metadata of 679,515 unique torrents from The Pirate Bay, collected on December 5th, 2008.The dataset does not provide the granularity of torrent contents, so we consider each torrent to correspond to one file for measurement.As analyzed in [3], this simplification does not positively impact the evaluation results.We sum the number of "seeders" (peers already having the whole file) and "leechers" (peers having only part of the file, but intending to get the whole file in the future) of a file as the data popularity.The files with zero size and zero popularity are removed, then we can get a dataset consisting of 442,332 unique files.We randomly select 200, 400, 600, 800, and 1000 files from the dataset to evaluate storage efficiency.The storage overheads of our single-tenant and multi-tenant schemes are almost the same, so we do not differentiate them.We compare our schemes with a baseline scheme non-dedup, which does not deduplicate any duplicate files.Besides, we set the popularity threshold t to 50, 100, and 200 to evaluate its impact on storage efficiency.Fig. 18 shows the evaluation result for storage overhead.As the number of selected files increases, our schemes save storage overhead more significantly compared with non-dedup.When 200 files are stored, the storage overheads for non-dedup and our schemes are 3674 MB (non-dedup), 1424 MB (t = 50), 1915 MB (t = 100), and 2364 MB (t = 200) respectively.Our schemes achieve storage savings of 35.7%-61.2%.When 1000 files are stored, the storage savings can be increased to 41.9%-66.5%.Moreover, we can find that a smaller threshold t can result in higher storage efficiency.This is because a smaller t can lead to more popular files, while our schemes save storage overhead by deduplicating popular files.

VIII. RELATED WORK
After MLE [10] and DupLESS [2] are proposed, researchers present many novel encrypted deduplication schemes.Liu et al. [11], [12] and Yu et al. [13] design encrypted deduplication schemes without additional independent servers.Besides, Shin et al. [45] propose decentralized server-aided encryption for secure deduplication to alleviate the single point of failure in DupLESS.Li et al. [46], [47] propose defense schemes for frequency analysis attacks in encrypted deduplication systems.Zhao et al. [48] propose the updatable MLE (UMLE), which allows the efficient update of the encrypted files stored in the cloud server.SGXDedup [49] is proposed to speed up encrypted deduplication via Intel SGX.Yang et al. [50] and Xu et al. [51] propose access control schemes for encrypted deduplication.Yu et al. [52] and Zhang et al. [53] propose encrypted deduplication schemes against side-channel attacks.R-MLE2 [8] and μR-MLE2 [7] are proposed to provide lock-dependent security for secure deduplication.None of the above schemes considers data popularity.The state-of-the-art popularity-based secure deduplication schemes are [3] and [4].As described in Sections II-C and VI-F, these schemes need to deploy trusted third parties, are vulnerable to popularity tamper attacks, and cannot provide scalability and updatability, while our schemes address these limitations.

IX. CONCLUSION
In this article, we first propose a single-tenant popularitybased encrypted deduplication scheme with fully random tags.We use HE to generate random tags, avoiding storing deterministic tags to record data popularity.Besides, we reduce the time complexity of tag equality-testing by the binary search in the AVL tree.We also design a PoW protocol to resist the popularity tamper attack.For scalability and key rotation, we expand our single-tenant scheme to a multi-tenant scheme by introducing MKH-PRE.In the multi-tenant scheme, users in different tenants use different HE key pairs to generate data tags, while the server could record the cross-tenant data popularity.The multi-tenant scheme also supports key rotation based on the proxy re-encryption of MKH-PRE.We implement prototypes of our schemes and evaluate their performances.The results show that our schemes have high storage efficiency and achieve efficient data encryption and key update.
Chunfu Jia A Ph.D. supervisor, a professor and the head with the the Department of cyber Sciences, Nankai University.His main research interests include network and system security, cryptography application, and malware analysis.
Yixuan Huang she is currently working toward the master's degree with the College of Cyber Science, Nankai University, Tianjin, China.Her research interest mainly includes homomorphic encryption.
Hang Chen is currently working toward the master's degree with the College of Cyber Science, Nankai University, Tianjin, China.Her main research interests include cryptography and data deduplication.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 2 .
Fig. 2. The records of the storage server and trusted third party.

r
CE.KeyGen(M ): On input a message M , output a con- vergent key K on M .Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

r
CE.TagGen(M ): On input a message M , output a data tag T on M .

r
CE.Enc(K, M ): It is a symmetric encryption algorithm.On input a convergent encryption key K and a message M , output a convergent ciphertext C. r CE.Dec(K, C): It is a symmetric decryption algorithm.On input the convergent key K and convergent ciphertext C, output the message M .

r
HE.KeyGen(λ): On input a security parameter λ, output the public key pk and secret key sk.

r
HE.Enc(pk, m): On input a message m and the public key pk, output the ciphertext ct of m.

r
HE.Dec(sk, ct): On input the ciphertext ct of m and the secret key sk, output the message m.

r
HE.Add(ct 1 , ct 2 ): On input ciphertexts ct 1 and ct 2 of m 1 and m 2 , output the ciphertext ct add of (m 1 + m 2 ).

r
HE.Sub(ct 1 , ct 2 ): On input ciphertexts ct 1 and ct 2 of m 1 and m 2 , output the ciphertext ct sub of (m 1 − m 2 ).

r
MP.KeyGen(λ): On input a security parameter λ, output the public key pk and secret key sk.

r
MP.Enc(pk, m): On input a message m and the public key pk, output the ciphertext ct of m.
PoW (C(pk, F c ), SS(c a , c b , P set ), CSP (sk)) →(ε, P set , ε): During the PoW protocol, client C inputs pk and a convergent ciphertext F c , SS inputs two random seeds (c a , c b ) and a proof set P set (described in Section IV-F), while CSP inputs sk.When the protocol concludes, C and CSP output nothing, while SS outputs an updated P set .Enc (k E , m) → C m : The encryption algorithm takes a key k E and a message m as input, and outputs a ciphertext C m .Dec (k E , C m ) → m: The decryption algorithm takes a key k E and a ciphertext C m as input, and outputs the message m.CV (SS(pk, rT, F c ), CSP (sk)) → (f, ε): During the ciphertext verification protocol, SS inputs pk, a random tag rT , and a convergent ciphertext F c , while CSP inputs sk.When the protocol concludes, SS outputs a flag f ∈ {0, 1} to indicate whether the verification is successful, while CSP outputs nothing.IV (k c , m) → f : The integrity verification algorithm takes a convergent key k c and a message m as input, and outputs a flag f to indicate whether the verification is successful.

F
and runs PoW with C .Specifically, SS generates a random coin b ∈ {0, 1} and a new random seed c 2 .Then it sends the challenge seeds (c b , c 2 ) to C .C generates proofs (p b , p 2 ) based on (c b , c 2 ) and then sends them to SS.The generation of (p b , p 2 ) is the same as that of (p 0 , p 1 ).SS computes a proof equalitytesting ciphertext etc p = HE.Sub(p b , p b ) and sends it to CSP , where p b is the proof stored in P set before.CSP returns the comparison result rs p = I(HE.Dec(sk, etc p )) to SS.If rs p is equal to 0, then C and C upload the same file and they both pass PoW, then SS increases the owner number of F by one, inserts (c 2 , p 2 ) into P set , and then add C into the list of data owners.Otherwise, either C or C uploads a fake proof, SS will treat F and F as two different files and set their popularity separately.If C m is the m − th uploader for F , SS randomly selects c i ∈ {c 0 , c 1 , . .., c m−1 } from P set and a new random seed c m to run PoW with C m .The process is the same as that of the second uploader.

Fig. 6 .
Fig. 6.The storage structure of SS (the threshold t is set to 100).
Data Download: To download file F , client C sends random tag rT to SS.If F is unpopular, SS sends random ciphertext F r to C. The latter uses random key k r to restore file F = SE.Dec(k r , F r ), where SE.Dec(•) is a symmetric decryption algorithm.Otherwise, if F is popular, SS sends convergent ciphertext F c to C. C uses convergent key k c to restore file F = CE.Dec(k c , F c ).After that, C runs IV(k c , F ) to perform the integrity verification.If the output is f = 1, then the integrity verification passes.Otherwise, the outsourced data may be tampered with adversaries, and the decryption fails.

Fig. 8 .
Fig. 8.The equality-testing for tags in the multi-tenant scheme.
Challenge: A outputs two messages m 0 and m 1 (the sizes of them are equal).C picks a bit b ∈ {0, 1} and generates a random symmetric key k c .Then, C computes a random tag rT b = RTG(pk, m b ) and a random ciphertext C b = SE.Enc(k c , m b ).Besides, C generates two random seeds (c 0 , c 1 ), a convergent ciphertext C b of m b , and then uses (c 0 , c 1 , C b , pk) to generate a proof set P b for m b by locally emulating PoW.Finally, C returns (rT b , C b , P b ) to A. Guess: A outputs a guess b ∈ {0, 1} of which message is chosen by C.

1 : 2 :
G 0 : is identical to G. G The challenger C replaces the random ciphertext C b by a random bit string C r , whose length is equal to C b .The indistinguishability of G 1 to G 0 follows from the semantic security of SE.G The challenger C replaces the random tag rT b by a random bit string rT r of equal length.Then C replaces all elements in the proof set P b by random bit strings of equal length to output a set P r , whose elements are all uniformly random.The indistinguishability of G 2 to G 1 follows from the semantic security of HE.

Fig. 10 .
Fig. 10.The performance comparison between our schemes and Sdedup when uploading unpopular data.

Fig. 11 .
Fig. 11.The time overheads of each component in each scenario.

Fig. 13 .
Fig. 13.Linear and binary search for random tags.

Fig. 15 .
Fig. 15.The overhead of each part in the tag equality-testing (single-tenant scheme).

Fig. 16 .
Fig. 16.The overhead of each part in the tag equality-testing (multi-tenant scheme).

Fig. 17
Fig. 17.The performance of key update.
Fig. 17.The performance of key update.

Ruiqi
Li received the PhD degree in computer science and technology from Nankai University, in 2021.He is currently an assistant professor with the College of Safety Science and Engineering, Civil Aviation University of China.His current research interests include fully homomorphic encryption, lattice-based cryptography, and cloud computing security.Qiaowen Jia is currently working toward the PhD degree with the Institute of Software, University of Chinese Academy of Sciences.Her research interest includes concurrent program and software verification.

TABLE I SECURITY
COMPARISON FOR POPULARITY-BASED ENCRYPTED DEDUPLICATION SCHEMES