A Data Deduplication Scheme Based on DBSCAN With Tolerable Clustering Deviation

To protect data privacy, users prefer to store encrypted data in cloud servers. Cloud servers reduce the cost of storage and network bandwidth by eliminating duplicate copies. To address the potential internal data leakage problem, the concept of clustering deviation is proposed for the first time. We improve the DBSCAN algorithm to tolerate clustering deviation. A data deduplication scheme is built upon the new algorithm, which considers users as clustering samples. Instead of immediately re-clustering new users, a certain deviation is tolerated to assign the users to the existing classes. We determine the popularity of the data according to user clustering results and apply different encryption schemes to protect the security of unpopular data more effectively. The performance of the algorithm is analyzed and compared with other methods through experiments, and the results verify the feasibility and efficiency of the proposed deduplication scheme.


I. INTRODUCTION
The development and application of cloud computing have led more and more users to store their data on the cloud server (CS) [1], [2], [3]. To save bandwidth and storage space, servers usually use data deduplication techniques, i.e., they maintain only a single copy of data and remove redundancy [4].
However, when uploading data to the CS, users want to encrypt the data to protect their privacy and prevent the data content from being obtained by CS or other attackers [5], [6]. However, with traditional encryption schemes, users randomly select keys and encrypt the plaintext. This makes the ciphertext stored on the CS different even for the same plaintext, which makes the deduplication operation very difficult. Conversely, when users encrypt the data with the same key, this can significantly reduce the security of the system. The associate editor coordinating the review of this manuscript and approving it for publication was Amjad Mehmood . Convergent encryption (CE) was proposed to solve this problem effectively [7]. In CE, the key is derived from the plaintext, so that the same plaintext produces the same key, which in turn produces the same ciphertext. This allows data deduplication of encrypted data. However, CE has security flaws and is vulnerable to offline brute-force attacks [8], because the key derivation process is deterministic.
In recent years, many researchers have worked on designing various Message-Locked Encryption (MLE)-based deduplication schemes [9], [10]. In response to the above attacks, Stanek et al. proposed a deduplication scheme based on popularity division [11]. Data with different popularity are encrypted using different encryption methods to further save cloud storage space and network bandwidth. Puzio et al. proposed a ClouDedup scheme with a metadata manager and an additional server defined in the CS: the server adds an encryption layer to prevent attacks against CE and thus protects the confidentiality of data [12]. DupLESS used a key manager to generate the key and applies the oblivious pseudorandom function (OPRF), which is a high-security algorithm [13]. The scheme of Zhang et al. used elliptic curve encryption algorithm to achieve data confidentiality, and different encryption methods were used for popular and unpopular data to reduce the computational overhead [14]. Liu et al. proposed a secure data deduplication scheme that does not require a third-party server [15]. This scheme adopts password-authenticated key exchange (PAKE) to implement cross-user key passing, thus achieving cross-user data deduplication. And it also eliminates the dependence on thirdparty servers and improves security. However, it requires all users involved in the protocol to be online when exchanging keys, which significantly increases the communication overhead and reduces the practicality.
Various existing data deduplication schemes focus on the protection and delivery of encryption keys and the identification of duplicate data while ignoring the impact of users on deduplication. Among the many schemes that differentiate data according to their popularity, for the data with fewer holders, a semantic security-compliant encryption scheme with higher security is used. When the number of data holders increases, the system considers that the data is less sensitive and uses a less secure encryption protection scheme such as CE. However, if the data belong to users from the same organization, such as a company's internal address book, the case will be different. That is, the increase in the number of data holders does not mean that its sensitivity decreases. If the system uses a less secure encryption protection scheme, it will cause the potential internal data leakage.

A. CONTRIBUTION
To address the above problems, we consider the effect of user attributes on the popularity recognization and propose a data deduplication scheme based on DBSCAN with tolerable clustering deviation (TCD-DBSCAN) algorithm. The contributions of this paper are as follows.
1) Propose a TCD-DBSCAN algorithm and use it for user clustering in data deduplication scheme to reduce the risk of internal data leakage. 2) Dynamic counting is performed based on the clustering results, and then the popularity of the data is determined based on whether the ''number'' of data holders reaches the popularity threshold, so as to avoid the premature conversion of internal privacy data to popular data. 3) A bilinear mapping is adopted to construct data tags for identifying the original plaintext. For unpopular data, an RSA-based blind signature algorithm is used between the user and key manager.

B. ORGANIZATION
The rest of this paper is organized as follows. In Section II, the required preliminaries for this paper are presented.
In Section III, we describe the system model and definitions. In Section IV, we propose the construction of the scheme. Section V presents the security analysis. We describe the simulation and experimental analysis of the scheme in Section VI. In Section VII, we summarize the paper.

II. PRELIMINARIES
In this section, we briefly introduce the basic knowledge about RSA-based blind signature algorithm, bilinear mapping,DBSCAN algorithm and attribute similarity calculation.
A. RSA-BASED BLIND SIGNATURE ALGORITHM This scheme adopts an RSA-based blind signature algorithm to generate the encryption key by the interaction between the Key manager and the user without any information disclosure [16]. Key manager holds the private key d and publishes the public key e, where e · d ≡1modϕ(n). The user chooses random value r ∈ Z * N and generates x ≡ h · r e mod n, where h is the hash value of the plaintext M . Then, the user sends x to Key manager. Once received x, Key manager calculates y ≡ x d mod n and returns it to the user, the user removes the blind value r and obtains the secret value z ≡ y · r −1 mod n. Finally, the user can use the encryption key K ← H 2 (z) to encrypt the data.

B. BILINEAR MAPPING
Let G,G T be 2 multiplicative cyclic groups of order q, where q is a large prime, g is a generating element of the group G. Define the mapping relation e: G × G → G T , and satisfy the following properties [17]: 1) Computability:For ∀ g 1 , g 2 ∈ G,there are valid and efficient algorithms to calculate e(g 1 , g 2 ) 2) Bilinearity:For ∀ g 1 , g 2 ∈ G,and a, b ∈ Z q ,e(g 1 a , g 2 b ) = e(g 1 , g 2 ) ab ; 3) Non-degeneracy: ∃ g 1 , g 2 ∈ G, e(g 1 , g 2 ) = 1.

C. DBSCAN ALGORITHM
Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised machine learning clustering algorithm [18].There are two important parameters in the DBSCAN algorithm: Eps( ) and MinPts, the former being the neighborhood radius when defining the density and the latter being the threshold value when defining the core point [19].
DBSCAN classifies sample points into three classes: 1) Core point:if at least MinPts samples exist in the -neighborhood of sample p,i.e.| N ∈(p)| ≥ MinPts, sample p is a core point.
2) Border point:if the number of samples in the -neighborhood of sample p is less than MinPts, then sample p is a border point. 3) Noise point:a point that is neither a core point nor a border point.

D. ATTRIBUTE SIMILARITY CALCULATION
When discussing the calculation of attribute distances, attributes are classified as ''ordinal attribute'' and ''nonordinal attribute'' depending on whether the attributes define an ''ordered'' relationship. For example, an attribute such as a user's IP address can determine the distance directly on the attribute value, which is referred to as an ''ordinal attribute''; an attribute such as a user's domain name cannot calculate the distance directly on the attribute value, which is referred to as a ''non-ordinal attribute''. Mixed attributes are handled by combining the Minkowski distance and value difference metric (VDM). Users . ,x jn } have n c ordinal attributes and n-n c non-ordinal attributes, then the distance between them is calculated as shown in Eq.(1).

III. SYSTEM MODEL AND DESIGN GOALS
In this section, we describe system model, threat model and the security goals.

A. SYSTEM MODEL
The system model of this scheme, including the cloud server, key manager and the user. As shown in Fig.1.

1) CLOUD SERVER (CS)
CS, an entity with huge computing power and storage capacity, mainly provides cloud storage services. Apply the TCD-DBSCAN algorithm to count the legal upload users and deduplicate the data to improve storage utilization.

2) KEY MANAGER (K-MAN)
K-man, a semi-trusted third party which is responsible for helping users generate encryption keys for unpopular data by performing the RSA-based blind signature algorithm. Users have access to K-man, and we will further explain the security of user interaction with K-man in Section 5.

3) USER (U)
Users, as the clients of the cloud storage system, require to upload and download data and other operations securely and conveniently on the CS. When uploading unpopular data, a blind signature operation with K-man is required to obtain encryption keys to ensure their data security.When uploading popular data, users only need to perform simple and efficient convergent encryption.

B. THREAT MODEL
We consider the following two types of attackers:

1) INTERNAL ATTACKERS
Meaning CS and potentially malicious users. We consider that CS is honest but curious and can have arbitrary access to their stored user data, potentially malicious users can interact with CS following all the protocols but they want to illegally access the data of other users.

2) EXTERNAL ATTACKERS
Meaning unauthorized users. They obtain information about part of the uploaded data by tapping the public channel of the internet, and their main purpose is to illegally obtain plaintext information about the data stored on the CS.

C. SECURITY GOALS
The security goals of the scheme in this paper are as follows: 1) Users are classified according to their attributes, so that internal data, which come from the same organization will not be prematurely turned into popular data. 2) All data are protected by encryption methods and the attackers should not get any plaintext information about them. 3) This scheme is resistant to online and offline bruteforce attack.

IV. PROPOSED SCHEME
We propose a TCD-DBSCAN algorithm to solve the problem of the potential internal data leakage in deduplication. The algorithm classifies users based on their attributes. The data is classified into popular data and unpopular data according to whether the counting of data owners reaches the popularity threshold. Different encryption methods are adopted for data with different popularity, so as to balance security and efficiency.

A. TCD-DBSCAN
Conventional DBSCAN algorithm require re-clustering when the data set is changed, which will greatly increase the computational overhead of the system. We first propose the concept of clustering deviation, i.e., ignoring certain errors and preferentially classifying newly added points into their nearby classes or treating them as noise points. Based on this, we designed a DBSCAN algorithm with tolerable clustering deviation algorithm. First, a conventional DBSCAN clustering is performed on the currently existing data set; thereafter, whenever a new point q appears, instead of re-clustering, the operation is performed in the following steps. The specific process is shown in Algorithm 1:

end if
If q is a point near the core point a, it is classified in the class where a is located. If a border point p exists near q, assign it to the class in which p exists, and record q as a deviation point. Determine whether q meets the core point condition and if so, assign it to the class whose density is connected. If none of the above conditions is met, q is treated as a noise point. In order to reduce the error, when the number of deviation points exceeds a threshold , it will be re-clustered.
B. NOTATIONS Table 1 shows some notations used in the scheme.

C. SYSTEM INITIALIZATION
In the initialization phase, a public-private key pair {P U , P R } is assigned to K-man. And a unique identity ID i is determined for each U i who joins the system and assigned a unique public-private key pair {pk i , sk i }. The file tag list File_List is stored on the CS and the file information

D. FILE UPLOAD
File upload means that the user uploads data to CS, which can be divided into unpopular file upload and popular file upload. First, U i send the upload_request ID i T F,i user attribute to CS. After CS receives the request from U i , it performs data duplication detection and runs the tag finding function re ← TCheck(T F,i , File_List). Fig.2 is the file upload process.

1) UNPOPULAR FILE UPLOAD
If the uploaded file F does not exist in the CS, F is the initial file. CS initialize the corresponding file information table to store the file ciphertext and update the user list. When the uploaded file F exists in CS and DB[T F ]. Count < t, F is an unpopular file.CS updates the information in the corresponding File_List and adds U i to the list of legitimate users.The detailed process is described in Fig.3. In particular, CS applies the TCD-DBSCAN algorithm to transform the process of unpopular file uploads and adopts the Growth Curve function to calculate the weight number occupied by the current uploader. The growth curve model ϕ can be expressed in Eq. (2): where x 1 and x 2 are the independent variables, y is the dependent variable, and a,b,k are the parameters.

2) POPULAR FILE UPLOAD
The uploaded file F exists in CS and DB[T F ].Count = t. Then the file F undergoes a popularity conversion. The CS updates the information in the corresponding file information table asks the user to upload convergent encrypted ciphertext. If the uploaded file F exists in the CS and DB[T F ]. Count > t, F is a popular file. CS adds U i to the list of legal users, and does not require the user to upload the convergent encrypted ciphertext. The detailed process is described in Fig.4.

V. SECURITY ANALYSIS
In this section, we formally prove that our scheme is correct and secure.

A. TAG SECURITY
In this scheme, the user calculates the file tag, and the CS detects the file repeatability based on the bilinear mapping. We analyze and prove the correctness and uniqueness of the file tag. Proof: According to the tag generation function, where C 1 is the convergent ciphertext corresponding to F. By the nature of bilinear mapping, the following equation can be obtained: = e(g,g) H (C 1 )·sk i ·sk j = e(g,g) H (C 1 )·sk j ·sk i = e(g H (C 1 )·sk j , g sk i ) = e(L j , & i ).

2) TAG UNIQUENESS
Theorem 2: Let the initial uploader of file F, U i , calculate the file tag T F,i =< L i ,& i >, and upload it to the CS to save it as the unique identifier of F. When the user U j uploads the file F , the file tag T F ,j =< L j ,& j > is calculated and uploaded to the CS for data duplicity detection. The scheme guarantees the uniqueness of data tag, i.e., there exists F = F such that the probability that e( Proof: The proof is carried out using the converse method. Suppose there exists

B. KEY SECURITY
The key generation for unpopular files is acquired by user interaction with K-man using a blind signature algorithm based on RSA. This scheme takes full advantage of the nature of blind signatures to ensure the security of the scheme itself: No individual except K-man can generate a valid blind signature in its name. The user obtains z after decrypting the blind signature y and determines whether y is a signature of K-man by verifying V (h,z)= TURE ⇔ h ≡ Z e mod n.
; Unpopular data ciphertext sets, If there exists C t = C or C n = C in the set of exhaustive ciphertexts, the adversary wins the game to obtain the plaintext data. The time complexity of the data for MU to successfully crack the data is O(2 mSize ) or O(2 mSize+rSize ), which is computationally infeasible, so the scheme is resistant to attacks from malicious users.

2) OFFLINE BRUTE-FORCE ATTACK
CS attempts an offline brute-force attack on the user uploaded file tag T F = e(g H (C 1 )·sk , g sk ) with CS knows g, public key pk, prime p. The attack is as follows: Let G be the multiplicative group of large prime number P, where g is the generating element, and for a given Q = x P ∈ G, compute x ∈ Z N Lemma 2: Diffie-Hellman Problem (CDH problem):Let G be the multiplicative group of large prime numbers P, where g is the generating element. the CDH problem can be described as follows: for a given g,g a ,g b ∈ G, compute Q = g ab ∈ G, where a, b are unknown integers.
Theorem 3: The scheme is resistant to offline brute-force attack.
Proof: Without loss of generality. From Lemma 1, the public key pk i of U i is known and guessing the secret key sk i is difficult. From Lemma 2, it is known that e(g sk i ,g sk i )and e(g H (C 1 ) ,g sk i ) to compute e(g H (C 1 )·sk i , g sk i ) is infeasible. Therefore, CS cannot perform the computation to get the tag set merged for comparison.

3) ONLINE BRUTE-FORCE ATTACK
To prevent online brute-force attacks by malicious users, our scheme also adopts a rate-limiting policy to limit the frequency of user-K-man interactions [15]. When Eq.(4) is satisfied, malicious users can be effectively prevented from performing online brute-force attacks on U i .

VI. SIMULATION AND EXPERIMENTAL ANALYSIS
The experiments adopt the OpenSSL [20], PBC [21], and GMP [22] cryptographic function libraries and implement VOLUME 11, 2023   the client and server software using the C++ programming language.We adopt the MD5 hash function to generate the data tags, SHA-256 is used to generate the convergent encryption key, and 256-bit strength AES is used to encrypt and decrypt the data.To simulate the Internet application environment, more than 500 files are stored on the CS and have been classified according to their attributes. We divide this scheme into four cases, i.e. initial file upload, unpopular file upload, popularity conversion file upload, and popular file upload. Each part of the operation is repeated 10 times and the average value is taken as the final result.
The Employee Attrition Analytics dataset [23] is adopted for the experiment, and the dataset is normalized and the features are downscaled to present better experimental results.

A. PERFORMANCE ANALYSIS
We analyze the performance of the communication overhead and compare it with Stanek et al. [11] and Gao et al. [24], as shown in Table 2. For file F, C T ,C C denote data tag size and ciphertext size respectively, C K , C rk ,C sk denote data encryption key size, random key size and user private key size respectively. C ID denotes the user identification size. Table 2 shows that our scheme is prior to other schemes in terms of communication overhead.
In Table 3, we compare the storage overhead of unpopular data with Stanek et al. [11] and Gao et al. [24]. N is the number of legitimate users of unpopular data. From Table 3, we can see that our scheme outperforms other schemes in terms of storage overhead for unpopular data.

B. TCD-DBSCAN EFFICIENCY
For the problem of selecting the parameters of the algorithm in this experiment, we follow the method proposed in [25]. The value of Eps can be obtained using the method of drawing k-distance graph, and there is a rule of thumb for the selection of MinPts, MinPts ≥ dim + 1,where dim denotes the data to be clustered dimensionality of the data to be clustered. Finally, after continuously adjusting the parameters, the value of Eps for this dataset is determined to be 1.85 and the value of MinPts is 12.
To simulate a real Internet application scenario, we store more than 500 different files in CS, each file is associated with multiple users. Based on the attributes of the uploaders, we clustered the users for each file. Then, new users with different attributes are simulated to perform an upload operation on a specific file, and the running time of the algorithm is recorded and compared with the conventional DBSCAN algorithm, and the test results are shown in Fig.5.
It can be seen that the TCD-DBSCAN algorithm is much better than the conventional DBSCAN algorithm in terms of computational efficiency. And the advantage of the TCD-DBSCAN algorithm becomes more obvious as the size of subsequent uploaders increases.

C. COMMUNICATION OVERHEAD
According to the characteristics of the scheme, we test the computational time overhead for four cases, each case includes tag generation, key generation, data encryption, and file uploading. We upload files of different sizes to CS and      e(g H (C 1 )·sk 1 , g sk 1 ) e(g H (C 1 )·sk 2 , g sk 2 ) e(g H (C 1 )·sk 3 , g sk 3 ) . . . e(g H (C 1 )·sk s , g sk s ) e(g H (C 2 )·sk 1 , g sk 1 ) e(g H (C 2 )·sk 2 , g sk 2 ) e(g H (C 2 )·sk 3 , g sk 3 ) . . . e(g H (C 2 )·sk s , g sk s ) . . . . . . . . . . . . . . . e(g H (C n )·sk 1 , g sk 1 ) e(g H (C n )·sk 2 , g sk 2 ) e(g H (C n )·sk 3 , g sk 3 ) . . . e(g H (C n )·sk s , g sk s )   perform simulation experiments. Fig.6 shows the time overhead of different file size in each case. And the total time overhead statistics for the four cases are shown in Fig.7. When compared with similar schemes such as Stanek et al. [11], PerfectDedup [26], Zhang et al. [14], and Gao et al. [24] in the same network environment the same file size (20MB), the result is shown in Fig.8. Overall, the scheme we proposed is prior to other schemes in terms of time overhead.

D. CHARACTERISTICS
The proposed scheme is compared with Stanek [11], Per-fectDedup [26], Zhang [14], Gao [24], Yuan [27] and Wang [28] in terms of scheme characteristics, and the results are shown in Table 4. Table 4 shows that this scheme considers the impact of user attributes on data popularity classification and deduplicates the unpopular data. It prevents internal data leakage, reduces storage overhead and improves deduplication efficiency.

VII. CONCLUSION
In this paper, we study the issue of encrypted data deduplication and propose a TCD-DBSCAN algorithm. The concept of clustering deviation is proposed, and our algorithm is applied in the deduplication process to reduce the risk of internal data leakage. Premature conversion of unpopular data is eliminated even if the data are uploaded by users from the same organization. For unpopular data, symmetric encryption is adopted and the encryption key is obtained by blind signature protocol, between the Key manager and the users. So that, deduplication of unpopular data can be achieved without user passing their keys online, which further improves the efficiency of deduplication. Security analysis and performance evaluation demonstrate that the proposed scheme is secure and of great practical value. YAN  FENG GUO is the Product Manager of Shandong Zhengzhong Information Technology Company Ltd., China. His research interests include asymmetric cryptographic algorithm and privacypreserving computing.