VeriDedup: A Verifiable Cloud Data Deduplication Scheme With Integrity and Duplication Proof

Data deduplication is a technique to eliminate duplicate data in order to save storage space and enlarge upload bandwidth, which has been applied by cloud storage systems. However, a cloud storage provider (CSP) may tamper user data or cheat users to pay unused storage for duplicate data that are only stored once. Although previous solutions adopt message-locked encryption along with Proof of Retrievability (PoR) to check the integrity of deduplicated encrypted data, they ignore proving the correctness of duplication check during data upload and require the same file to be derived into same verification tags, which suffers from brute-force attacks and restricts users from flexibly creating their own individual verification tags. In this paper, we propose a verifiable deduplication scheme called VeriDedup to address the above problems. It can guarantee the correctness of duplication check and support flexible tag generation for integrity check over encrypted data deduplication in an integrative way. Concretely, we propose a novel Tag-flexible Deduplication-supported Integrity Check Protocol (TDICP) based on Private Information Retrieval (PIR) by introducing a novel verification tag called <inline-formula><tex-math notation="LaTeX">${note\ set}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mspace width="4pt"/><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="yan-ieq1-3141521.gif"/></alternatives></inline-formula>, which allows multiple users holding the same file to generate their individual verification tags and still supports tag deduplication at the CSP. Furthermore, we make the first attempt to guarantee the correctness of data duplication check by introducing a novel User Determined Duplication Check Protocol (UDDCP) based on Private Set Intersection (PSI), which can resist a CSP from providing a fake duplication check result to users. Security analysis shows the correctness and soundness of our scheme. Simulation studies based on real data show the efficacy and efficiency of our proposed scheme and its significant advantages over prior arts.


INTRODUCTION
C LOUD computing has become a popular information technology service by providing huge amount of resources (e.g., storage and computing) to end users based on their demands. Among all cloud computing services, cloud storage is the most popular. Since the volume of data in the world is increasing rapidly, saving cloud storage becomes essential. One of the key reasons that causes storage waste is duplicate data storage. Multiple users may save same files or different files containing same pieces of data blocks at the cloud. Obviously, duplicate data storage at the cloud introduces a big waste of storage resources. Data deduplication [1], [2], [3] provides a promising solution to this issue. In a deduplication scheme, the CSP can cooperate with the cloud user to first check whether a pending uploaded file has been saved already or not, and then provide the user whose pieces of file data are checked duplicate a way to access the file without storing another copy at the cloud.
However, since the CSP cannot be fully trusted, the cloud users may suffer from some security and privacy issues. Notably, a semi-trusted CSP may modify, tamper or delete the uploaded data driven by some profits. The damage of deduplicated data could cause huge loss to all related users (e.g., data owners and holders). Thus, the integrity of the data stored at the cloud should be verified, especially for duplicate data storage with deduplication.
Several Proof of Retrievability (PoR) schemes [4], [5], [6], [7], [8], [9] have been proposed to address the issue of integrity check on cloud data storage in recent decade. In such schemes, a user adds verification tags along with a file. During the verification, the user creates a random challenge and sends it to the CSP, the CSP has to use all the data in user's corresponding files it stored as inputs to compute a response back to the user. The user then checks the integrity of the stored file by verifying the response. However, existing PoR solutions mainly aim to improve the performance at the user side and assume that the CSP has infinite computation and storage resources. While, in practice, the CSP performs data deduplication in order to achieve the most economic usage of its storage. Unfortunately, existing solutions mentioned above are incompatible with deduplication. This is because the verification tags of these schemes are created with user individual private keys unknown to each other, thus different verification tags are generated, given the same file held by different users. But these verification tags cannot be deduplicated at the CSP as shown in Fig. 1a.
Message-locked PoR [10], [11] provides a promising solution to check data integrity when performing deduplication. It derives a same file into a same verification tag based on message-locked encryption technique as shown in Fig. 1b. However, such design restricts the users from creating their own individual tags with their private keys. Practically, we expect an effective method that can check data integrity with the support of deduplication where each user can generate its own individual verification tags from its private key against brute-force attacks.
Another security issue ignored by the previous literature is the correctness guarantee of data duplication check provided by the CSP. Several schemes [12], [13] motivate the CSP to perform deduplication, but ignore that the CSP could cheat the users by providing a fake duplication check result. The reason is simple since the CSP can gain an extra profit by asking the users to pay normal storage fee without granting a deserved discount while performing deduplication to save storage space. As shown in Table 1, we illustrate four situations that the CSP deals with a duplication check about file storage. We find that a problem may happen in the third situation that the CSP actually has the file tested duplicate but tells the user that it is not in order to let the user pay a normal storage fee without any discount, which should be offered due to deduplication and storage saving. By saving extra storage space, the CSP can earn more by serving more users with the same dishonest way. An effective mechanism should be proposed to prevent the user from being cheated by the CSP in the phase of data duplication check.
In this paper, we propose a novel deduplication scheme called VeriDedup to tackle the above two security issues in an integrative way. It contains a novel Tag-flexible Deduplicationsupported Integrity Check Protocol (TDICP) and a novel User Determined Duplication Check Protocol (UDDCP). The TDICP explores a new verification tag called note set in which each note is a randomized bit sequence that is conform to a function f. The note set is inserted into the files based on Private Information Retrieval (PIR). TDICP allows the users to create their own individual verification tags to check data integrity over the CSP with deduplication compatibility. Meanwhile, the UDDCP explores a new challenge and response mechanism based on Private Set Intersection (PSI) to let the user instead of the CSP tell whether the file is duplicate first, so that the CSP cannot cheat the user on the result of duplication check during file upload. VeriDedup is built upon our previous scheme [14], which offers such functionalities as deduplication over ciphertext, Proof of Ownership (PoW) and key assignment by employing proxy re-encryption (PRE). While, in this paper, we focus on integrity check and duplication check that are ignored in [14]. Thus, we assumed the functionalities of PoW and encrypted data deduplication are available and are not the focus of this paper.
Specifically, the main contributes of this paper are summarized as below: We propose a novel protocol named TDICP based on PIR to check the integrity of uploaded files in the CSP with deduplication employed. TDICP allows users to generate their own individual verification tags for integrity check while the verification tags can also be deduplicated at the CSP although different. We propose another novel protocol named UDDCP to guarantee the correctness of duplication check based on PSI, so that the CSP is impossible to cheat the user to pay for unused storage space due to deduplication. We construct a novel deduplication scheme called Ver-iDedup that contains the above two novel protocols and other essential properties, such as PoW and data  access key assignment by re-shaping our previous scheme in [14] in order to overcome its shortcomings regarding integrity and duplication proof. We prove the security of TDICP and UDDCP by constructing several games and conduct both theoretical analysis and experimental simulation to evaluate their performance. Our results show their efficacy and efficiency. The rest of the paper is structured as follows. Section 2 briefly reviews related work. Section 3 introduces the main techniques used in our scheme. Section 4 presents the system model, threat model, and design goals of VeriDedup. We present the detailed construction of VeriDedup that contains TDICP and UDDCP in Section 5. We prove the security of the above two novel protocols in Section 6 and report the simulation based evaluation results in Section 7. Finally, we conclude this paper in Section 8.

RELATED WORK
Our work is most related to the proof of retrievability (PoR) solutions in cloud data deduplication [15], [16], [17]. Juels and Kaliski [15] proposed a sentinel-based PoR scheme, in which a data owner adopts Error Correcting Code (ECC) and inserts special blocks called sentinels. The sentinels are indistinguishable from encrypted blocks in an encrypted file. During integrity challenge, a verifier asks a prover for those sentinels by disclosing their positions to the verifier. Therefore, this solution supports a limited number of PoR queries and after several times of queries, a data owner has to download the whole file and insert new sentinels to it. Ateniese et al. [6] proposed a scheme by defining the concept of Provable Data Possession (PDP) based on homomorphic tags, which is weaker than PoR in the way that it can verify that the CSP possesses parts of the file (called blocks) but cannot guarantee that the file is fully stored. Their scheme allows public verifiability, which means that any third party can verify the integrity of the files without disclosing any private information of the data owner. However, the usage of homomorphic tags incurs high computation cost, which brings heavy computation burden to the data owner. Their later work [7] cooperates with an erasure code to help recover small corruptions. However, their solution suffers from such an attack that CSP can selectively delete some of redundant blocks but still can succeed in providing valid proof to the data owner.
Much effort was then made to improve the performance of PoR schemes. Shacham and Waters [18] proposed a new solution based on their proposed concept of Compact PoR, which adopts an erasure code and an authenticator with a BLS signature [19] and Message Authentication Codes (MAC) [11]. However, the computational complexity of generating the authenticator is high and the number of the authenticators is linear to the number of blocks. Xu and Chang [16] proposed to enhance the scheme in [18] with an polynomial commitment [18] to reduce communication cost. Azraoui et al. [20] proposed a scheme called Stealth-Guard by using PIR within Word Search (WS) technique to retrieve a witness of watchdogs (similar as tags) and allows an unlimited number of queries. Compared with other works, the generation of watchdogs is more lightweight than the generation of tags like in [7], [18]. In addition, the overhead of storing the watchdogs is less than that of previous work. However, those works fail in supporting deduplication over verification tags.
The concept of message-locked proofs of retrievability was then proposed to solve the above conflicts. Bellare et al. [21] formalized a new cryptographic primitive called Message-locked Encryption (MLE) that subsumes convergent encryption [22], [23] that derives the same data block to the same verification tag to allow deduplication of all verification tags. Chen et al. [24] proposed a secure data deduplication mechanism based on an improved MLE scheme to enable dual-level source-based deduplication of large files. Moreover, Zheng et al. [5] introduced a new proof of storage scheme with deduplication based on a publicly verifiable proof of data possession. In their scheme, users can verify the correct storage of deduplicated data with the key of the first user who actually uploads the file. However, this scheme has been proved insecure under a weak key attack in [25] and it cannot prevent the users from being cheated by the CSP. Vasilopoulos et al. [10] proposed a scheme by transforming the existing PoR into a form that is messagelocked and integrating it with a deduplication function. However, these works require to derive the same file into the same verification tag. But multiple users holding the same file stored at the cloud may create different tags as their willingness for data integrity check, which improves integrity check security by overcoming brute-force attacks, but impacts deduplication. Table 2 compares our scheme with existing works in terms of unlimited queries, tag deduplication support, tag generation flexibility, and duplication check correctness guarantee. From Table 2, we observe that existing works either cannot perform deduplication on verification tags or do not allow the users to flexibly create their own individual verification tags during deduplication. In particular, none of the existing schemes considers the necessity of correctness guarantee on duplication check, which allows the CSP to cheat the users for gaining profits.

PRELIMINARY
In this section, we introduce the main techniques used in VeriDedup, including PRE, RSA-Private Set Intersection (RSA-PSI), and PIR. PRE is applied to assign file keys to an authorized data holder, RSA-PSI is applied to enable the  [7] ✓ Â ✓ Â [16], [18] ✓ data holder to first decide whether a file is duplicate instead of the CSP, and PIR to enable the data holder to retrieve the noteset without exposing the position of the set to the CSP.

Proxy Re-Encryption (PRE)
A PRE scheme consists of five polynomial time algorithms: Key generation(KG), Encryption(E), Re-encryption key generation(RG), Re-encryption(R) and Decryption(D): ðKG; E; DÞ are the standard key generation, encryption and decryption algorithms. Suppose we have two parties A and B. On input the security parameter 1 k , KG outputs two public and private key pairs ðpk A ; sk A Þ and ðpk B ; sk B Þ. On input pk A and data M, E outputs a ciphertext C A ¼ Eðpk A ; MÞ.
On input ðpk A ; sk A ; pk B Þ, the re-encryption key generation algorithm RG outputs re-encryption key rk A!B for a proxy.
On input rk A!B and ciphertext C A , the re-encryption function R outputs Rðrk A!B ; On input C B and sk B , the decryption algorithm D outputs the plaintext M ¼ Dðsk B ; C B Þ.

RSA-PSI
PSI [26], [27], [28] enables two parties to compute the intersection of their inputs in a privacy-preserving way, such that only their common inputs are revealed. A PSI scheme based on RSA blind signature (RSA-PSI) consists of four main phases: base phase, setup phase, online phase, and update phase: Base phase: Suppose we have a client C and a server S. S and C agree on the RSA public key ðN; eÞ and the false positive rate for the cuckoo filter CF [29]. S generates the RSA private key d, C chooses N max c random numbers and calculates their inverses as well as their modular exponentiation to the power e.
Setup phase: On input the set owned by S, S encrypts it using its private key d and inserts the ciphertexts into the cuckoo filter and sends the CF to C.
Online phase: C first blinds its inputs with the encryption of the respective random values and sends the resulting values to S. S responds the result to C by encrypting the resulting values using its private key d. Using the inverse of the respective random values, C can then unblind the encrypted blinded values through multiplications by applying the property of RSA that x ed x mod N. C finally obtains the intersection by checking whether the unblinded encrypted elements are in the CF that was sent by S in the setup phase.
Update phase: On inputs a new element u i to its input, S encrypts it using its private key d and decides an efficient option to insert it into CF and sends the updated CF to C.

PIR
PIR [30], [31] enables a database user, or a client to obtain some information from the database in a way that prevents the database from knowing which data was retrieved. Assume a dataset D is a X Â Y matrix obtained by a server S and we have a client C, let l donate the index of column the client is interested in. In order to execute a PIR request, a PIR scheme normally performs the following steps: Setup phase: C generates a large number m as the order of a group G, selects a random b 2 Z Ã M , where gcdðb; mÞ ¼ 1, and keeps b and m as a secret.
Query phase: C generates a set ðe 0 ; . . . ; e i Þ for each column ðx 0 ; . . . ; x i Þ, which holds that for a random selected set ða 0 ; . . . ; a i Þ, if x i l is one of queried columns, then e i l ¼ a i l N r ; otherwise, if x i is not one of the queried columns, then e i ¼ N l þ a i N r . Meanwhile, it holds that all e i < m=ðtðN À 1ÞÞ. Then, C computes v ¼ fv i jv i ¼ be i mod mg and sends Req ¼ fv; tagg to S.
Response phase: When receiving Req, S computes Resp ¼ v Â D and sends it back to C.
Extraction phase: C computes Res ¼ Resp Ã b À1 mod m and obtains the data of the queried column.

PROBLEM STATEMENT
In this section, we describe the system model, the threat model and the design goals of VeriDedup.

System Model
VeriDedup offers grarantee on the correctness of duplication check and supports the integrity check of deduplicated encrypted data in cloud storage.
Our target system contains three types of entities: 1) Data holder who owns data and saves its data that consists of multiple blocks at CSP. It is possible that a number of eligible data holders share the same encrypted data blocks in the CSP. In particular, the data holder that first uploads the data blocks to the CSP is denoted as a data owner with regard to the same blocks. 2) CSP who provides a data storage service with deduplication to data holders. Only one data copy is stored at the CSP, which can be accessed by all data holders with authority. 3) Authenticated auditor (AA) who serves as a third party to check data ownership, authorize data access and cooperate with other two types of entities aiming to audit the whole procedure of data duplication check. The system model of VeriDedup is shown in Fig. 2.

Threat Model
We perform our research based on the following assumptions. We assume that the data holder is honest. We assume the CSP is semi-trusted. It may raise the following three security threats: 1) Snooping the private data of the data holders; 2) Cheating the data holders by providing a wrong duplication check result in order to ask a higher storage fee; 3) Causing data loss due to carelessness of data maintenance. In VeriDedup, we focus on the last two issues since many existing solutions of the first issue can be found in the literature [5], [32]. Thus, we assume that the first issue has been solved, e.g., through data encryption. In addition, we assume AA and CSP do not collude. However, AA is semi-trusted, which is curious about the data stored at the cloud, thus private data should be kept away from AA. We assume data holders, CSP, and AA communicate with each other through secure channels by applying some security protocol (e.g., Open-Secure Sockets Layer (SSL)). And all system parameters are shared with all related parties during system setup or initialization phase in a secure way.

Design Goals
VeriDedup is a verifiable cloud data deduplication storage scheme with integrity and duplication Proof. It holds the following design goals: Independent integrity check when deduplication: Veri-Dedup allows the data holder to check the integrity of its files stored at the CSP without downloading the whole files and interacting with the corresponding data owner. Flexible tag generation: VeriDedup allows each data holder to create its own individual verification tags while still can perform data deduplication over those tags. Correctness guarantee of duplication check: VeriDedup can assure the correctness of duplication check. Thus, a semi-trusted CSP can never cheat the data holders to upload any data that have already been stored by the CSP.

THE PROPOSED SCHEME
In this section, we introduce VeriDedup that can realize both integrity check and duplication proof over encrypted cloud data deduplication.

Overview
VeriDedup follows the construction of our previous deduplication scheme [14] and improves it by using PSI and PIR to ensure both data integrity and duplication check correctness over encrypted data deduplication. Specifically, compared with previous work, we introduce a PSI based challenge and response mechanism to the duplication check procedure in order to let the data holder first tell whether the uploaded blocks are duplicate or not instead of the CSP. In addition, we employ AA to verify the computations of the CSP during the duplication check, so that the CSP cannot cheat the users to upload data blocks that have been stored already. Furthermore, we propose a note insertion mechanism based on PIR to let the data holder insert a specific set (called note set) that contains several randomized bit sequences, which conform to a hidden function, as verification tags into the encrypted blocks of a uploaded file. The data owners/holders who are proved to have the ownership of the corresponding blocks can verify the integrity of the uploaded blocks through a challenge on whether the notes are conform to the hidden function. Attention need be given that the verification tags generated by multiple data holders with various notes can also be deduplicated in Veri-Dedup, so that the CSP will no longer be required to maintain multiple pieces of verification tags from the same block of different data holders for integrity check, which reduces storage consumption of performing deduplication. In what follows, we first introduce the two proposed novel protocols (i.e., TDICP and UDDCP) and then detail the whole construction of VeriDedup.

TDICP Design Brief
The protocol TDICP contains the following main procedures: System setup, Note generation and insertion, Check Initialization, Response computation, and Integrity check.
System setup: On input the security parameter , AA outputs a hidden function f which is then applied for note generation.
Note generation and insertion: On input the hidden function f and the secret keys of a data holder, the data holder outputs a randomized note set S and a position set P according to the uploaded blocks of its file and inserts the note set into the corresponding positions of the encrypted blocks.
Check initialization: On input the check indexes of the blocks, the data holder outputs a coefficient set e and computes the challenge set v ¼ b Ã e mod m, where gcdðm; bÞ ¼ 1.
Response computation: On input the challenge set v, CSP outputs the response Integrity check: On input the response Resp, the data holder outputs the check result by computing Res ¼ Resp Ã b À1 mod m to pick out the note set and validating whether these notes conform to the hidden function f. If the verification passes, the data holder confirms the integrity of the stored file.

UDDCP Design Brief
The protocol UDDCP contains the following main procedures: System setup, Filter generation, Check initialization, Response computation, Duplication check, and Filter update.
System setup: On input the secret parameter , the data holder outputs a RSA key pair ðe; dÞ under a large number N and AA initializes an empty cuckoo filter.
Filter generation: On input the CSP maintained tag set fxg, AA outputs the cuckoo filter as follows: 1) CSP computes a ¼ HðxÞ d for each tag of its tag set; 2) AA verifies the number of involved tags, the signature of the tags, and the computation of the CSP; 3) AA inserts the set fag into the cuckoo filter.
Check initialization: On input the secret parameter , the data holder outputs three coefficient set frg, fr inv g, and fr 0 g for its maintained tag set fyg and computes the challenge A ¼ HðyÞ Ã r 0 for all y.
Response computation: On input the challenge set fAg, CSP computes C ¼ A d mod m and responds fCg to the data holder.
Duplication check: On input the response set fCg, the data holder outputs the duplicate tags as follows: 1) validate the computation of the CSP on fCg; 2) compute all C Â r inv mod N and check them in the cuckoo filter to find intersections as the duplicate tags. The data holder confirms the duplicate files corresponding to the duplicate tags.
Filter update: On input the update tag set fy 0 g, AA updates the cuckoo filter as follows: 1) CSP computes a 0 ¼ Hðy 0 Þ d for each y 0 in the update tag set; 2) AA verifies the number of involved tags, the signature of the tags, and the computation of the CSP; 3) AA inserts the set fa 0 g into the cuckoo filter.

VeriDedup Construction
VeriDedup contains the following main procedures: System setup, Data preprocessing and Duplication Check, Note set insertion and Data Upload, Data integrity check, and Data Download. The details of the scheme are elaborated as follows:

System Setup
Assuming that e : G 1 Â G 1 ! G T is a bilinear map where G 1 , G T are two groups of prime order q, the system parameters are random generators g 2 G 1 and Z ¼ eðg; gÞ 2 G T .
During system setup, each data holder u w generates sk w ¼ a w and pk w ¼ g a w for PRE, where a w 2 Z pÃ . The public key pk w is used to generate the re-encryption key at AA for u w . let Eqða; bÞ be an elliptic curve over GF ðqÞ, P Ã be a base point shared among system entities, s w 2 R f0; . . . ; 2 s À 1g be the Eillptic Curve secret key of data holder u w and V w ¼ Às w P Ã be the corresponding public key and s be a security parameter. The keys ðpk w ; sk w Þ and ðV w ; s w Þ of u w are bound to a unique identifier of the data holder, which can be a pseudonym that is crucial for the verification of user identity.
AA generates a hiddenfunction f, as a consensus that all the data holders will later use to create their unique note sets S w;i and broadcasts f among all data holders. Note that, f can be an arbitrary function chosen depending on the security level required by the data holders. Furthermore, AA generates pk AA and sk AA for PRE and broadcast pk AA to the data holders.
CSP initializes a RSA algorithm with a public and secret key pair ðe; dÞ under the module N, The key pair is used to encode the uploaded tags of the data holders and the CSP for duplication check.

Data Preprocessing and Duplication Check
Suppose that two data holders u 1 and u 2 want to upload their data files F 1 and F 2 to the CSP. Let u 1 the first to upload the file, it performs the data preprocessing and duplication check as follows: Step 1: On input F 1 and the symmetric key DEK 1 . u 1 perform the following computations: 1) Divide F 1 into several splits where each split contains m blocks. In order to protect the file from small corruptions, adopt Error Correcting Code (ECC) to extend m blocks to m þ d À 1 blocks, which can correct up to d 2 errors with an efficient ½m þ d À 1; m; d ECC, such as Reed-Solomon code [33] and obtain a set of blocks fB 1;i g. 2) For each block B 1;i , u 1 generates a block tag y 1;i ¼ HðHðB 1;i Þ Â P Ã Þ. 3) Send the set of tags fy 1;i g to the CSP.
Step 2: Suppose that the CSP maintains a tag set fx j g gathered form previous data owners, the CSP interacts with AA and u 1 to perform a duplication check according to the following procedure. 1) For all x j , the CSP generates a j ¼ Hðx j Þ d mod N, and sends fa j ; Hðx j Þ; signðHðx j ÞÞg; D, where j ¼ 1; . . . ; N s , to the AA, where signðHðx j ÞÞ is the signature of Hðx j Þ signed by the original data owner of the tag x j and D is the total number of the tags held by the CSP. 2) Receiving what the CSP sends, AA first verifies whether fa j ; Hðx j Þ; signðHðx j ÞÞg contains D elements to guarantee that CSP uses all its maintained tags to perform computations. If it holds, it second verifies all the signatures on Hðx j Þ, which ensures that the CSP indeed uses the tags uploaded from the previous data owners. Third, AA will further validate the correctness of the CSP computations on all x j using a batch verification by randomly creating N v non-overlap subset of fa j ; Hðx j Þg and in each subset verifying whether Q Hðx j Þ ¼ ð Q a j Þ e holds. If all the verification passes, the AA assumes that the CSP computations are correct and creates a cuckoo filter CF as input of fa j g, i.e., CF ¼ CF:Insertðfa j gÞ as a response to u 1 . Note that, this procedure is only executed once during the system setup, if another data holder requires to upload new files to the CSP, the CSP will cooperate with AA to update the cuckoo filter, there is no need to re-calculate the parameters of previous data owners mentioned above. 3) For all y 1;i , u 1 first selects random numbers r 1;1 ; . . . ; r 1;Nc and computes r inv ; iÞ e holds to prove the correctness of the CSP computations and finally checks duplication with the cuckoo filter CF using algorithm CF:checkðC½1; i Á r inv 1;i mod NÞ to confirm the duplicate blocks. The whole protocol is shown in Fig. 3.

Note Set Insertion and Data Upload
Let u 2 the second to upload the file that obtains the same pieces of blocks as u 1 , u 1 and u 2 perform the note insertion and data upload as follows: Step 1: Since u 1 is the first to upload a new file that has not been stored by the CSP before, i.e., the duplication check is negative, it is served as a data owner and is required to upload its corresponding blocks fB 1;i g. Assume the ith block B 1;i is uploaded, u 1 first encrypts B 1;i with DEK 1 to get CT 1;i , which is stored as a X Â Y matrix, and encrypts DEK 1 with pk AA to get CK 1 . Let S 1;i ¼ fh 1;i;0 ; . . . ; h 1;i;k jfðh 1;i;0 ; . . . ; h 1;i;k Þ ¼ 0g be a note set that conforms to the hidden function f. According to the PIR algorithm, let B 1;i be a seed, u 1 shuffles the column index ½1; . . . ; X and selects the first r columns as the ones to insert the notes. Thus, in each column, c ¼ dk=re notes are required to be inserted. Furthermore, in each selected column, u 1 further shuffles the row index ½1; . . . ; Y and decides the first c indexes as the final positions to insert the notes. Denote the position indexes as P 1;i ¼ fp 1;i;1 ; . . . ; p 1;i;k g, u 1 then inserts all the notes fh 1;i;k g into CT 1;i according to the position indexes P 1;i to obtain CKI 1;i and sends CKI 1;i and CK 1 to CSP along with pk i . At the same time, u 1 also uploads tags of the new blocks fy 1;i g for further duplication check. On receiving a new block tag y 1;i , the CSP first adds them to its maintained tag set x j ¼ x j S y 1;i and then computes a i ¼ Hðy 1;i Þ d mod N, and sends fa i ; Hðy 1;i Þ; signðHðy 1;i ÞÞg to AA. AA then first checks the signatures on Hðy 1;i Þ, and further randomly creates N 00 1;v non-overlap subset of ffa 1;i g; fHðy 1;i Þgg and in each subset verifies whether Q Hðy 1;i Þ ¼ ð Q a 1;i Þ e . If the verification passes, AA assumes that the CSP computation is correct and updates the cuckoo filter CF using CF ¼ CF:Insertðfa 1;i gÞ, which will be used in the next duplication check round. If the duplication check is positive and the prestored blocks are from the same data holder, the data holder will inform the CSP to do nothing but maintain its blocks. If the blocks are from a different data holder, it will inform the CSP to perform deduplication.
Step 2: Informed the duplication from a different user u 2 , the CSP first checks the ownership of the blocks by passing the ownership verification tasks to the AA, which will challenge the data holder u 2 on whether it is the real party who possesses the data blocks B 2;i 0 ¼ B 1;i . We introduce an ownership verification protocol based on a cryptoGPS identification scheme [34]. In the protocol, AA first randomly chooses c 2 R f0; . . . ; 2 s À 1g and challenges u 2 by c. u 2 computes h ¼ HðB 2;i 0 Þ þ ðs 2 Â cÞ as a response along with V 2 to AA. AA will computes HðhP Ã þ cV 2 Þ and compares it with tag y 1;i . If the verification passes, i.e., y 1;i ¼ HðhP Ã þ cV 2 Þ, AA confirms that u 2 has the duplicated blocks B 2;i 0 ¼ B 1;i and generates re-encryption key rk AA!u j ¼ RGðpk AA ; sk AA ; PK 2 Þ and sends it to CSP. CSP then transfers CK 1 to CK 2 by computing Rðrk AA!u 2 ; Eðpk AA ; DEK 1 ÞÞ ¼ Eðpk 2 ; DEK 1 Þ for u 2 .
At this moment, both u 1 and u 2 can access the same data blocks B 1;i ðB 2;i 0 Þ stored at the CSP and use its corresponding CTI 1;i ðCTI 2;i 0 Þ to perform the below integrity check. Note that, each B 1;i is only correlated with single CTI 1;i , i.e., CTI 1;i ¼ CTI 2;i 0 .

Data Integrity Check
Assume that data owner u 1 wants to upload a block set fB 1;i g and data owner u 2 wants to upload a block sets fB 2;i 0 g. Regardless of deduplication, when user u 1 challenges the integrity of a single block B 1;i stored at CSP, it first initializes a large number m and b 2 Z Ã m , where gcdðb; mÞ ¼ 1, as a secret. According to the position indexes P 1;i , it then generates a set ðe 1;i;0 ; . . . ; e 1;i;z Þ for each column ðx 1;i;0 ; . . . ; x 1;i;z Þ with random selected ðd 1;i;0 ; . . . ; d 1;i;z Þ, where if x 1;i;l 2 P 1;i , then e 1;i;l ¼ d 1;i;l N r ; otherwise, if x 1;i;l = 2 P 1;i , then e 1;i;l ¼ N l þ d 1;i;l N r . Meanwhile, it holds that all e 1;i;z < m=ðtðN À 1ÞÞ for some choice of l < r and d 1;i;l . Finally, it computes v 1;i ¼ fv 1;i;l jv 1;i;l ¼ be 1;i;l mod m; l 2 ½1; . . . ; zg and sends Req 1;i ¼ fv 1;i ; tag 1;i g to the CSP. Receiving Req 1;i , the CSP computes Resq 1;i ¼ v 1;i Â B 1;i as a response and sends it back to u 1 . u 1 then computes Res 1;i ¼ Resp 1;i Â b À1 mod m to obtain the queried columns and then pick out the notes according to the position indexes P 1;i to check whether the notes are conform to the hidden function. Similarly, when user u 2 challenges the CSP, it generates its unique ðm 0 ; b 0 Þ as a secret and also its unique ðd 2;i 0 ;0 ; . . . ; d 2;i 0 ;z Þ to generate other ðe 2;i 0 ;0 ; . . . ; e 2;i 0 ;z Þ and its further Req 2;i 0 ¼ ðv 2;i 0 ; tag 2;i 0 Þ. In cooperation with the CSP, u 2 can also obtain the note set based on the position indexes P 2;i 0 to check whether the notes are conform to the hidden function.
Furthermore, suppose that u 1 and u 2 shares a same duplicated block fB Ã i g and u 1 has its unique block fB 1 i g and u 2 has fB 2 i g. For B 1;i 2 fB 1 i g, u 1 verifies fðh 1;i;0 ; . . . ; h 1;i;k Þ ¼ 0 to check the integrity of B 1;i as well as for B 2;i 0 2 fB 2 i g, u 2 verifies fðh 2;i 0 ;0 ; . . . ; h 2;i 0 ;k Þ ¼ 0. For B Ã;i fB Ã i g, although u 2 is unaware of the exact inserted notes of u 1 , since they both share the same hidden function f and P 1;i ¼ P 2;i 0 , they all can verify that fðh Ã;i;0 ; . . . ; h Ã;i;k Þ ¼ 0 to check the integrity of B Ã;i . Therefore, we not only deduplicate the same block uploaded to the CSP, but also take a further step to deduplicate the verification tags of duplicated blocks generated by multiple data holders. Note that since we apply ECC to help recovering the files, there is no need to perform integrity check over all blocks. If u 1 and u 2 can succeed in performing above g times random verification in all its corresponding block sets, our protocol guarantees the integrity of F 1 and F 2 . The whole procedure is shown in Fig. 4.

Data Download
When u 1 wants to download F 1 . It sends a request and the file name to the CSP. Upon receiving the request, the CSP first checks if u 1 has the authorization to download the file. If passed, CSP returns the corresponding block sets fCTI 1;i g to u 1 . u 1 then extracts all the notes according to the position indexes P 1;i on each block to get the ciphertexts fCT 1;i g ¼ fCT 1 i g S fCT Ã i g and decrypts each CT 1;i using DEK 1 directly to obtain S fB 1;i g ¼ fB Ã i g S fB 1 i g. Owing to ECC, u 1 can recover F 1 from S fB 1;i g with errors no more than d 2 . As for u 2 , after following the same steps to obtain the ciphertexts fCT 2;i g ¼ fCT Ã i g S fCT 2 i g, it also receives a reencrypted DEK 1 key Dðsk 2 ; Eðpk 2 ; DEK 1 ÞÞ from the CSP. u 2 can then obtain the key DEK 1 using its key pair ðpk 2 ; sk 2 Þ and decrypt each CT Ã i to get the duplication original blocks fB Ã i g and its unique original blocks fB 2 i g by directly using DEK 2 . Finally, it can obtain the original file S fB 2;i g ¼ fB Ã i g S fB 2 i g and recover F 2 using ECC.

Further Discussion
We recognize the fact that the CSP is likely to increase its income with massive amounts of computation/storage from deduplication. In this case, confirming deduplication happened already at the CSP to get an offer of low storage charge becomes essential, our paper aims to solve this issue. For motivating the adoption of our scheme, in another line of our work, we study how to make all related stakeholders to accept and use deduplication schemes by applying game theory to design proper incentive or punishment mechanisms in three cases: client-controlled deduplication [35], [36], server-controlled deduplication [12] and hybrid deduplication [13]. Since our scheme design is built upon the one in [14], belonging to server-controlled deduplication, the incentive mechanism [12] suitable for the server-controlled deduplication schemes can be applied to motivate scheme adoption. Moreover, linking a trust value to each CSP can help the users to choose a trustworthy CSP.

SECURITY ANALYSIS
In this section, we prove the correctness and the soundness of TDICP. Correctness means that the integrity check algorithm can correctly extract a queried column and soundness means that the original file can be recovered if the corresponding TDICP integrity check passes. We also prove the soundness and privacy of UDDCP, and omit the proof of correctness, since it is obvious. Soundness means that the CSP cannot provide fake computation results during the whole procedure, privacy means that none of the information of both the CSP and the data holder are leaked to the other except for the intersection, and correctness means that the data holder can correctly pick up all the intersection of its tag set and the CSP's tag set.

Correctness of TDICP
We first prove the correctness of TDICP on extracting the queried column where the notes are inserted based on the PIR algorithm. During the Integrity check phase, the data holder computes as follows: When e i is the queried column, e i ¼ N l þ a l N r , we have P x i¼1 e i d ij ¼ P x i¼1 ðN l þ a l N r Þd ij , then P x i¼1 e i d ij mod N r ¼ P x i¼1 N l d ij Otherwise, e i ¼ a k N r , we have P x i¼1 e i d ij ¼ P x i¼1 ða k N r Þd ij , then P x i¼1 e i d ij mod N r ¼ 0 Assume that i r is the queried column, it holds that P x i¼1 e i d ij ¼ P r i¼1 N l d irj ¼ ðd irj Þ N Above all, all the elements in the queried i r th column are obtained.

Soundness of TDICP
Then, we further prove the soundness of TDICP by introducing the following game.
Assume there is an adversary A that corrupts on average r adv blocks of an outsourced file, and succeed in the soundness game of the proposed protocol with the probability of d. In the following proof, we show that if the query times g exceeds a threshold g neg , our protocol can recover the whole file with a probability of more than 1 À n 2 t , where t is the security parameter, when there exists an adversary A that can succeed in the soundness game with the probability d ! d neg ¼ 1 2 t . Remind that n is the length of the notes and s is the number of the notes in a note set, We first quantify d with respect to the parameter r adv . In order to succeed in the soundness game, the adversary A can perform under the following two conditions. 1) it does not corrupt any note; 2) it corrupts some of the notes, but can still provide valid notes that conform to the hidden function. Therefore, we define the probability that the adversary A can succeed in the soundness game with respect to r adv as: r ¼ P A ðSuccess;iÞ ¼ ð1 À r adv Þ þ r adv 2 ns . In TDICP, the integrity check requires the adversary A to response g valid note sets to succeed in the soundness game, therefore Note that if ns is large enough, i.e., ns ¼ 128, will then be negligible. We can simplify the above equation that if ns ! 128, d % ð1 À r adv Þ g .
We then define a threshold r neg with respect to r adv that if r adv < r neg , the probability of our protocol that fails in recovering the blocks is negligible.
Since TDICP adopts ECC and can recover rD ¼ We derive r neg as the bound of r adv 1 À r r neg 2 r neg ¼ 3 lnð2Þt D and r neg < r Next, we define a threshold g neg for the query time g that if an adversary A corrupts more than r neg fraction of the blocks, it will be detected by our protocol with an overwhelming probability. In other words, if g > g neg and r adv > r neg , then the probability of the adversary A to succeed in the soundness game is negligible. Then According to the equation ln x x À 1, when r adv > r neg g neg ¼ lnð2Þt r neg À lnð2Þt lnð1 À r neg Þ À lnð2Þt lnð1 À r adv Þ Finally, we define the probability of a file to be recovered. Since if there exists one block failing to be recovered, the whole file fails to be recovered. Let Q Fail be the probability that the file fails to be recovered, then Q Fail P n i¼1 P ðFail;iÞ . If we assume the probability of the files that fails to be recovered is negligible, i.e., P ðFail;iÞ 1 2 t . The probability of the files to be successfully recovered is

Privacy of UDDCP
We further prove the privacy of UDDCP based on the irreversibility of the cuckoo filter. In UDDCP, the data holder is private, which leaks no information to the CSP about its private inputs. Since the data holder selects all values uniformly and at random, i.e., fr 1 ; . . . ; r N c g Z Ã n , thus, r inv i and r 0 i are all random sequences. The data holder masks its inputs A½i to the CSP with random values r 0 i , so that CSP cannot obtain any other Hðy j Þ of the data holder except for the intersection. The CSP is private which leaks no information to the data holder since we introduce a cuckoo filter to store the computation results a i in filtergeneration phase. Due to the irreversibility of the filter, the data holder cannot obtain any other Hðx i Þ except for the intersection.

Soundness of UDDCP
We prove the soundness of UDDCP by illustrating how it can solve all potential cheats the CSP can perform, including 1) the CSP may provide unauthorized tags that are not from previous data holders or delete some stored tags driven by some profits; 2) the CSP may provide wrong computation results of a j or C½i to the AA or the data holder.
In UDDCP, the first cheat can be tested, since we employ AA to verify all the signatures and record the number of the CSP's tag set. Unauthorized tags created by the CSP are easily found out and the CSP is audited to provide all the tags from previous data owners. The second cheat can also be tested, since we let AA to verify whether Q Hðx j Þ ¼ ð Q a j Þ e holds, which can be proved correct according to the multiplication homomorphism of RSA. Wrong computations of any a j or C½i can be detected by the AA.

PERFORMANCE ANALYSIS AND EVALUATION
In this section, we perform theoretical analysis, conduct simulation based evaluation on VeriDedup, and compare its performance with related previous works. In addition

Evaluation Metrics
We applied five metrics in our simulation studies to evaluate TDICP, including (1) the data owner's computational complexity for creating and inserting the note set; (2) the data holder's storage overhead for extra data storage in integrity check; (3) the data holder's computational complexity for challenging CSP and retrieving the inserted note set for verification; (4) CSP computational complexity for responding the challenge from the data holder; (5) Data holder-CSP communication cost for transferring extra data in integrity check. The communication cost of AA to broadcast the hidden function f is omitted, since it is a one-time cost regardless with integrity check interactions.
Meanwhile, we used six metrics in our simulation studies to evaluate UDDCP, including (1) the data holder's computational complexity for initializing duplication check; (2) CSP's computational complexity for preprocessing its tag set and responding the challenge from data holders; (3) AA's computational complexity for verifying CSP computation and setting up the cuckoo filter; (4) the data holder's computational complexity for confirming duplicate blocks; (5) the communication cost from CSP to AA for constructing the cuckoo filter; (6) the communication cost between the data holder and CSP for transferring extra data in duplication check. The communication cost from AA to the data holder for transferred the cuckoo filter is omitted, since it depends on the concrete type of the cuckoo filter.
We can find several previous schemes [7], [15], [16], [18], [20] with the aspect of integrity check. These schemes focus on integrity check in various scenarios. We found that StealthGuard [20] is the only one that targets on the same integrity check issue over data deduplication as ours. Thus, we chose StealthGuard as a baseline scheme and compare its simulation performance with ours in terms of integrity check. Meanwhile, to the best of our knowledge, our scheme is the first to consider the issue on proving the correctness of duplication check. Thus, in what follows, we provide the evaluation result of UDDCP without comparison with other previous work.

Experimental Settings
We implemented our scheme in Python and tested it on a desktop equipped with an i5 CPU, 8GB RAM, and 64-bit Win10 OS. We chose SHA-256 for the cryptographic hash function and 1024-bit RSA for digital signature. We employed the Crypto library to realize Advanced Encryption Standard (AES) encryption. We applied a MySQL database to store data and their related information and built secure channels between entities using SSLsocket. We employed a cuckoo filter with 12.6 bits on average per item, which can offer a false positive rate of 0.19%. In our tests, we focus on testing the performance of our proposed TDICP and UDDCP, the rest including encryption, decryption, and key re-encryption, can be found in our previous paper for details.
Assume there exists n 0 elements after the block has been transformed into a matrix. Let a note set conform to the hidden function that contains r notes and a number of k note sets are required to be inserted, there exists s ¼ k Ã r notes.
Assume that the data holder inserts c notes in each column at average, then s=ðtcÞ times of queries will be needed to fetch all the notes. Therefore, the communication cost of the data holder and the CSP to check the integrity is ðx þ yÞ Ã jmj Ã s=ðtcÞ. When x ¼ y, the equation reaches the best. Thus, in our experiment, we set x ¼ y ¼ ffiffiffi n p in order to minimize communication cost. Meanwhile, the times of multiplication in the integrity check is 2 Ã x Ã s=ðtcÞ þ x Ã y Ã s=ðtcÞ þ s ¼ 2 Ã x Ã s=ðtcÞ þ n Ã s=ðtcÞ þ s þ y. Therefore, the larger t Ã c, the less multiplications needed. Thus, we set c ¼ y; t ¼ ds=ce to minimize the times of multiplication needed for integrity check.

Performance Analysis
In this subsection, we analyze the performance of our two proposed protocols. Hereafter, computation costs are represented with exp for exponentiation, mul for multiplication, PRF for pseudo-random function, PRP for pseudo-random permutation, INV for inversion, and enc for encryption.

Performance Analysis on TDICP
We first did theoretical analysis on integrity check performance. The proposed TDICP involves two types of system entities: data holder/data owner and CSP. In order to be compatible with the chunk setting of previous deduplication schemes [37] and compare computational complexity and communication cost with other pervious PoR schemes [7], [15], [16], [18], [20], we follow the settings of the prior arts to split a 4 GB file into a number of blocks with 128 KB, so that we got each block containing 16384 elements and each element is 64 bits. In order to reduce the probability of CSP to find collisions that can help passing the integrity verification, we selected the hidden function as notes½1knotes½2 ¼ ðHash SHA256 ðnotes½3knotes½4ÞÞ 128 , which can be proved secure and efficient in [38] and inserted 8 pairs of notes as verification tags into each block. The size of each notes is 64bits. We adopted ECC in all splits and required it to correct 5% errors (912 elements). Thus, each block contains 18240 ¼ 16384 þ 32 þ 912 Ã 2 elements. We adopted AES for symmetric encryption and PRE proposed in [39]. We analyzed the computational complexity, storage overhead, and communication cost of each entity at various phases as below. The result is shown in Table 3.
Computational complexity of data owner at setup phase. Regarded as the first data uploader, the data owner setups the whole integrity check by inserting the initial notes into each block. Assume that the data owner would like to insert k pairs notes into one block, it performs 4k pseudo-random permutation (PRP) operations to decide the position P i and 2k pseudo-random function (PRF) operations and k HASH to derive all the notes. Therefore, in our test, a 4GB file will contain 32768 blocks, as we set k ¼ 8 for comparison with previous work, TDICP requires the data holder to perform 1048576 PRP, 524288 PRF, and 262144 HASH. Compared with the most related work StealthGuard, since we introduce a hidden function instead of the watchdog in Stealth-Guard, we bring more costs to the data owner. However, since the setup phase is a one-time cost regardless of the integrity check, it is reasonable since our protocol provides a new feature that we can also deduplicate the verification tags, which is a step forward than StealthGuard.
Storage overhead of data holders. In order to fulfill the task of integrity check, all the data holders need to record the position set P i , so that they can later retrieve the notes based on the PIR algorithm. In our protocol, the size of each position information is 15bits. Thus, the data holder needs 4k Ã 15 bits additional storage to record these positions. As to a 4GB file, it costs an additional 32768 Ã 4 Ã 8 Ã 15 ¼ 1:875 MB. Compared with StealthGuard and other previous work, our scheme saves more or less storage due to the novel verification tags.
Computational complexity of CSP at response phase. As the entity who responses the integrity challenge from the data holders, the CSP performs 1 Ã x Ã y multiplication (mul) to compute Resp ¼ v i Â D. Therefore, assume that x ¼ 135 and y ¼ 136 for the matrix D, the CSP performs 18360 mul to compute a response to the data holder, and totally 1719 Ã 18360 ¼ 31560840 mul for all 1719 notes. Compared with StealthGuard, our scheme reduces almost 20 times of computational complexity, it is mainly because in Stealth-Guard, the CSP needs to transform the matrix D into a bit matrix, which increases the number of computations.
Computational complexity of data holder at challenge and response phase. In each challenge phase, the data holder performs x mul and x PRF for generating the private coefficient e i and x mul to compute v i ¼ be i mod m. Assume that x ¼ 135, the data holder then performs 270 mul and 135 PRF to generate one challenge. As a total, the data holder performs 464130 mul and 232065 PRF to generate all challenges for a 4GB file. In each verification phase, there exist a best situation and a worst situcation. In the best situation, the data holder performs 4 þ y mul to extract a queried column based on PIR algrithm. In the worst situation, the data holder performs ð4 þ 3 þ 2 þ 1Þ þ y mul to extract the queried column. Therefore, in the best situation, the data holder in TDICP performs 140 mul and 1 HASH to verify that the note conform to f. In the best situation, the data holder then performs 146 mul and 1 HASH to verify the note set. As a total, the data holder performs 250974 (240660) mul and 1719 HASH to verify a 4GB files. Compared with StealthGuard, our scheme is more efficient than StealthGuard at the challenge phase and a little bit worse than StealthGuard during verification. The reason is that our TDICP performs less computations to retrieve the tags, but performs more computations to verify whether the notes conform to the hidden function. In fact, for a same file that contains one pair of notes or a single watchdog, our method of note insertion only affects the complexity of the verification phase, and the effect is very small regarding to the above statement.
Communication cost of challenge and response phase. When the data holder challenges the integrity of a block, it sends a ½1; x vector to the CSP. The size of each element in the vector is 334 bits. Thus, the size of each challenge is 334 Â x bits. As a total, the size of each challenge is 334 bitsÃ135 ¼ 5636:25 bytes and 9.24 MB for a 4 GB file. As to the response phase, the CSP sends a ½1; y vector back to the data holder, the size of each element in the vector is also 334 bits. Thus, the size of each response is 334 Â xbits. Remembering that y ¼ 136, as a total, the size of each response is 334 bitsÃ136 ¼ 5678 bytes and 9.31 MB for a 4 GB file. Compared with StealthGuard, our communication cost is smaller since in our scheme, we reduce the data needed to be retrieved based on PIR algorithm.

Performance Analysis on UDDCP
We then theoretically analyze the performance of duplication check. The proposed protocol involves three types of system entities: data holder, AA, and CSP. Assume that the data holder holds N c tags and the CSP has already maintained N s tags. We adopt RSA for signing a signature and analyze the computational complexity and comminication cost of each entity at various phases as below. The results with respect to computational complexity are shown in Table 4.
Computational complexity of data holder at preprocessing phase: The data holder who wants to check whether its uploaded blocks are duplicate needs to perform N c PRF, N c INV, and N c exp to initialize the PSI algorithm, which are all in proportion to the number of tags held by the data holder.
Computational complexity of CSP at filter generation and duplication check phase: As we can see in the protocol, the CSP at setup and online phase performs module exponential calculations whose computation complexities are N s and N c exp, respectively.
Computational complexity of AA at filter generation phase: During the setup phase, AA performs three types of verification: 1) Tag number verification, i.e., N¼ ? N s , which can be omitted; 2) Signature verification: AA performs N s exp to verify all signatures provided by the CSP; 3) CSP computation verification: Using batch verification, AA performs 2 Ã N s and N s =, where represents the size of each non-overlap subset that verifies the CSP computations according to the corresponding tag value. Also, AA needs to construct a cuckoo filter with N s elements. Attention needs to be paid that, once the system is set up, when several new tags which have not been maintained by the CSP come, AA only needs to perform verification on the new coming tags instead of all the tags maintained by the CSP. This implies that the total computational complexity of AA is proportional to N 0 c instead of N s . Computational complexity of data holder at duplication check phase: At the duplication check phase, the data holder first performs N c mul to create a challenge to the CSP. In order to verify the response sent back from the CSP, the data holder then conducts N c exp computations to verify CSP compuatation. Finally, N c mul operations is needed for the data holder to check duplication with the help of the cuckoo filter provided by the AA.
Communication cost from CSP to AA: During the setup phase, CSP sends its all maintained tag values, the corresponding signatures and the computation results fa i g to the AA, whose element size is 256 bits, 576 bits, and 1024 bits, respectively. As a total, the CSP is required to transfer (256 bits+576 bits +1024 bits)*N s =232*N s bytes to the AA for constructing the cuckoo filter, which is linear to the number of tags maintained at the CSP.
Communication cost between data holder and CSP: At the duplication check phase, the data holder sends fA½ig to the CSP. Since we set the module N as an integer with 1024 bits, the size of each A½i is 1024 bits. As a total, the size of each challenge is 1024*N c bits. As to the CSP, it responses the challenge with fC½ig whose element size is 1024 bits. As a total, the size of each response is 1024*N c bits. Thus, the total communication cost between the data holder and CSP is (1024 bits+1024 bits)*N c =256*N c bytes, which is in proportion to the tag number of the data holder.

Performance Evaluation
In this subsection, we present simulation based evaluation results of the two proposed protocols.

Performance Evaluation on TDICP
We first present the performance evaluation result of TDICP and compare it with StealthGuard in terms of setup cost, integrity check cost at the CSP and the data holder (DH), respectively. Since we propose the novel note set as the verification tags, the note ratio, i.e., the ratio of the size of inserted notes to that of the block, is an unique evaluation parameter in TDICP, we evaluate it without comparison.
Impact of note ratio: Figs. 5a, 5b and 5c shows note insertion cost, integrity check cost, and note removing cost of our scheme with the note ratio varying from 0.02 to 0.10 and notes size of 32 KB, 64 KB, and 128 KB, respectively. As we can see, the larger the note size is, the higher the note insertion cost, integrity check cost, and note removing cost, which is the same as our expectation. When the note ratio increases, all these costs increase linearly since our meta verification block is a note set that contains 4 notes that conform to the hidden function. The increase of note ratio causes the increase of operation time regarding inserting, verifying, and removing those similar verification blocks. Impact of tag size: Figs. 6a, 6b, 6c, 6d and 6e shows the setup cost, the data holder storage overhead, the CSP integrity check cost, the data holder integrity check cost , and the total integrity check cost of TDICP with regard to the size of notes (watchdogs) varying from 2 KB to 14 KB compared with Stealth-Guard. Fig. 6a compares the setup cost of our scheme with StealthGuard. The setup cost increases as the size of tag increases in both schemes as expected. As we can see, TDICP incurs a higher computation cost than the StealthGuard at the setup phase. The reason is that TDICP needs to additionally perform multiple HASH operations and permutations than the StealthGuard. Fig. 6b compares the storage overhead of TDICP with StealthGraud at the data holder. StealthGuard incurs higher storage overhead since it requires the data holder to record all the watchdogs and TDICP requires the data holder to store the position index P of the notes whose size is smaller than that of the watchdogs. Fig. 6c compares the CSP cost of TDICP with StealthGuard. StealthGuard incurs higher computation cost since it requires the CSP to transfer the data into 80bits matrix, which increases the times of multiplication executed at the CSP. Fig. 6d compares the data holder cost of TDICP with StealthGuard. We can see that StealthGuard incurs higher computation cost since StealthGrard requires the data holder to perform more computations on extracting the verification tags from the response. As a total, Fig. 6e concludes and compares the total cost on checking the integrity of a 128KB file with StealthGuard. We can see that TDICP outperforms StealthGuard with respect to the computation cost in both the CSP and the data holder side, and all of those costs increase as the size of notes (watchdogs) increases.

Performance Evaluation on UDDCP
We further tested the performance of UDDCP. We assume that the CSP has already maintained a larger number of tags than the data holder. Since the computational complexity of signature verification is obviously linear to the number of the tags. Our simulation focuses on the AA and data holder verification on the CSP computations, respectively. We also evaluated UDDCP performance in various sizes of the nonoverlap subset as we introduce batch verification into UDDCP. Fig. 7a presents the verification cost of AA on CSP computations. As we can see, in all sizes of the subset, the verification cost increases linearly to the the number of CSP tags as expected. Meanwhile, the larger size of each non-overlap subset, the lower the verification cost, since the times of exponentiation needed for verification decreases. Fig. 7b presents the verification cost of the data holder on CSP computations. Similar as the verification at AA, the verification cost increases linearly to the number of data holder tags as expected. Also, the higher size of each non-overlap subset, the lower the verification cost. Fig. 7c presents the communication cost between the CSP and AA. We can see that the communication cost increases linearly as the number of elements in the CSP tag set X increases, which is the same as expectation. The reason is that the CSP is required to provide all its maintained tags to the AA for computation and signature auditing. Fig. 7d presents the communication cost between the CSP and the data holder. As the number of elements in data holder tag set Y increases, the communication cost increases linearly as expectation. The reason is that the data holder sends all its masked tag values to the CSP as challenges and the CSP then responses all the challenges, which is linear to the number of tags.

CONCLUSION
In this paper, we introduced VeriDedup to check the integrity of an outsourced encrypted file and guarantee the correctness of duplication check in an integrated way. The integrity check protocol TDICP of VeriDedup allows multiple data holders to verify the integrity of their outsourced file with their own individual verification tags without  interacting with the data owner. On the other hand, we employed a novel challenge and response mechanism in the duplication check protocol UDDCP of VeriDedup to let the data holder instead of the CSP first tell whether a file is duplicate in order to guarantee the correctness of duplication check. Security and performance analysis show that VeriDedup is secure and efficient under the described security model. The result of our computer simulation further shows its efficiency compared with highly related prior arts.
Hui Bai received the BEng degree in information security from Xidian University, Xi'an, China, in 2019. She is currently working toward the master's degree in cyberspace security at Xidian University, Xi'an, China. Her research interests include verifiable computation and machine learning.
Zheng Yan (Senior Member, IEEE) received the DSc degree in technology from the Helsinki University of Technology, Espoo, Finland, in 2007. She is currently a professor with the School of Cyber Engineering, Xidian University, Xi'an, China and a visiting professor and Finnish Academy research fellow with the Aalto University, Helsinki, Finland. Her research interests include trust, security, privacy, and security-related data analytics. She is an area editor or an associate editor of the IEEE Internet of Things Journal, Information Fusion, Information Sciences, IEEE Access, and Journal of Network and Computer Applications. She served as a general chair or program chair for numerous international conferences, including IEEE TrustCom 2015 and IFIP Networking 2021. She is a Founding Steering Committee cochair of IEEE Blockchain conference. She received several awards in recent years, including the Distinguished Inventor Award of Nokia, Aalto ELEC Impact Award, the Best Journal Paper Award issued by IEEE Communication Society Technical Committee on Big Data and the Outstanding Associate Editor of 2017 and 2018 for IEEE Access, etc.
Rui Zhang (Member, IEEE) received the BE degree in communication engineering and the ME degree in communication and information system from the Huazhong University of Science and Technology, Wuhan, China, in 2001 and 2005, respectively, and the PhD degree in electrical engineering from the Arizona State University, Tempe, Arizona, in 2013. He has been an assistant professor with the Department of Computer and Information Sciences Department, University of Delaware since 2016. Prior to joining UD, he had been an assistant professor with the Department of Electrical Engineering, University of Hawaii from 2013 to 2016. His research interests include security and privacy issues in wireless networks, mobile crowdsourcing, mobile systems for disabled people, cloud computing, and social networks. He received the U.S. NSF CAREER Award in 2017.