Introduction
Nowadays, the cloud computing technology is considered as a rapid developing and popular model of distributed computing and storage, which has the advantages of high-quality data storage, quick and convenient computing and “on-demand service”, etc. Outsourcing service built on the cloud can effectively reduce the maintenance cost of enterprises purchasing hardware and software and managing data. Attracted by the convenience, economy and high scalability appealing features, more and more individuals and enterprises are motivated to outsource their data or computing to the cloud. However, in the outsourcing cloud, Data Owner(DO) is unable to directly control and manage data stored on the Cloud Server(CS), thus, DO cannot certain data whether be protected and whether be legally and reasonably used and computed, which leads to the privacy of data is seriously threatened. At present, the privacy protection exists in the outsourcing cloud has become a major obstacle impeding its further development [2].
In the outsourced cloud, a native scheme proposed to protect data confidentiality is to encrypt data before outsourcing data to cloud. However, encrypted data cannot be directly searched and used. When the scale of data is smaller, DO can download all data to local computer and then decrypt these data, thereby obtain needed information from plaintext data. But in the current increasingly popular Big Data applications, utilize this method will cause a huge cost of time and bandwidth in terms of acquiring needed information, therefore this method does not possess essential practicality. Therefore, it is a challenge to perform privacy-preserving ranked search over encrypted cloud data.
In this paper, we propose a privacy-preserving multi-keyword ranked search over encrypted data in hybrid clouds. The keyword partition vector model is presented, in which the keywords of documents are clustered by a given bisecting
The contributions of this paper are as follows.
We proposed a keyword partition vector model. In this model, a bisecting
-means clustering based keyword partition algorithm is proposed, which generates the balanced keyword partitions and the keyword partition based bit vectors (DFB-vector and QFB-vector). DFB-vectors are the index for searches.$k$ On the basis of the keyword partition vector model and the complete binary tree structure, we propose an efficient ranked search scheme over encrypted data in hybrid clouds. The private cloud filters out the candidate documents, and then the public cloud determines the result.
We analyze the security of the proposed scheme and evaluate its search performance. The result shows that the proposed scheme is a privacy-preserving multi-keyword ranked search scheme for hybrid clouds and outperforms the existing scheme FMRS in terms of search efficiency.
Related Work
To support the multi-keyword search over the outsourced encrypted cloud data, researchers have proposed many Searchable Encryption (SE) schemes [4]–[18].
Song et al. [4] proposed the first symmetric searchable encryption (SSE) scheme. Cao et al. [5], [6] proposed the first multi-keyword ranked search scheme. The vector space model (VSM) [19] and secure KNN [20] are adopted to achieve the privacy-preserving ranked searches. Xu et al. [7] proposed a two-step-ranking search scheme over encrypted cloud data which adopts the order-preserving encryption (OPE) [21], [22]. Yang et al. [8] proposed a fast privacy-preserving multi-keyword search scheme. It supports dynamic updates on documents. Li et al. [9], [10] proposed a fine-grained multi-keyword search scheme over encrypted cloud data. However, only boolean queries are supported. Xia et al. [11] proposed a secure and dynamic multi-keyword ranked search scheme by adopting a balanced binary tree index. Chen et al. [13] and Zhu et al. [12] proposed two different privacy-preserving ranked search schemes, which both utilize clustering algorithm to improve search efficiency.
Fu et al. [15] and Wang et al. [14] proposed multi-keyword fuzzy search schemes over encrypted outsourced data. To achieve fuzzy search, the locality-sensitive hashing functions [23], wordnet and secure KNN are adopted. Wang et al. [16] presented a multi-keyword fuzzy search scheme which supports range queries by adopting the locality-sensitive hashing functions, bloom filtering [24] and order-preserving encryption. Fu et al. [17] proposed a synonym expansion of document keywords and realized the synonym-based multi-keyword ranked search scheme. Xia et al. [18] proposed a multi-keyword semantic ranked search scheme where the inverted index for documents and the semantic relationship library for keywords are adopted. Fu et al. [25] proposed a different semantic-aware ranked search scheme which adopts the concept hierarchy and the semantic relationship between concepts.
According to the state-of-art, most of the existing works focus on the public clouds. Only Yang et al. [26] proposed a search scheme for the hybrid clouds which consist of the public cloud (Pub-Cloud) and the private cloud (Pri-Cloud). In the scheme, Pri-Cloud is assumed trust while Pub-Cloud is assumed honest-but-curious. Keywords of documents are equally divided into multiple partitions and document index vectors are created for each document according to the partitions. Pri-Cloud utilizes document index vectors to obtain candidate document identities and then Pub-Cloud determines the result encrypted documents whose identities are the candidates. The more partitions the searched keywords cover, the more candidate document identities Pri-Cloud obtains. The search efficiency is proportional to the number of partitions covering the queried keywords. In practical, the searched keywords are usually relevant to each other. For example, basketball, NBA, slam dunk could be the queried keywords for retrieving the interested news, and they are obviously relevant. Therefore, if the keywords with high relevance are gathered in fewer partitions, then the search efficiency will be improved when the searched keywords are relevant.
Notations and Preliminaries
A. Notations
— A plaintext document.$d_{i}$ — A plaintext document collection,$D$ .$D=\{d_{1},d_{2},,d_{m}\}$ — The$V_{d_{i}}$ -dimensional document vector of$n$ .$d_{i}$ — The set of document vectors of documents in$V_{D} $ ,$D$ .$V_{D}=\{V_{d_{1}},V_{d_{2}},\ldots,V_{d_{m}} \}$ — The encrypted document of$\widetilde {d}_{i} $ .$d_{i}$ — The encrypted document collection of$\widetilde {D}$ ,$D$ .$\widetilde {D}=\{\widetilde {d}_{1},\widetilde {d}_{2},\ldots,\widetilde {d}_{m}\}$ — The encrypted$\widetilde {V}_{d_{i}}$ -dimensional document vector of$n$ .$d_{i}$ — The set of encrypted documents vectors,$\widetilde {V}_{D}$ .$\widetilde {V}_{D}=\{\widetilde {V}_{d_{1}},\widetilde {V}_{d_{2}},\ldots,\widetilde {V}_{d_{m}}\}$ — A keyword dictionary having$W$ keywords,$n$ .$W=\{w_{1},w_{2},\ldots,w_{n}\}$ — A list of keyword partitions,$P\!L$ .$P\!L=\{P_{1}, P_{2}, \ldots,\,\, P_\tau \}$ — The$V\!F_{d_{i}}$ -dimensional DFB-vector of$\tau $ .$d_{i}$ — The set of DFB-vectors of the documents,$V\!F_{D}$ .$V\!F_{D}=\{V\!F_{d_{1}},V\!F_{d_{2}},\ldots,V\!F_{d_{m}}\}$ — A query request with multi-keywords.$Q$ — The$V_{Q}$ -dimensional query vector of$n$ .$Q$ — The trapdoor of$\widetilde {V}_{Q}$ which is the encrypted$Q$ -dimensional query vector.$n$ — The$V\!F_{Q}$ -dimensional QFB-vector of$\tau $ .$Q$ — A set of candidate document IDs for the query$C\!I\!D$ .$Q$
B. Preliminaries
1) Vector Space Model
The vector space model [19] adopting TF-IDF model [27] is widely adopted in secure multi-keyword search [5], [6], [11]–[13]. We also use such models in this paper. TF and IDF are the term frequency (TF) and inverse document frequency (IDF). The former is the number of times a given keyword or term exists in documents while the later is calculated by dividing the total number of all documents by the number of documents having the given keyword or term. Each document \begin{align*} V_{d_{i}}[j]=&T\!F_{d_{i},w_{j}}/\sqrt {\sum _{w_{j} \in d_{i} \wedge d_{i} \in D}(T\!F_{d_{i},w_{j}})^{2}}\tag{1}\\ V_{Q}[j]=&I\!D\!F_{w_{j}}/\sqrt {\sum _{w_{j}\in Q}(IDF_{w_{j}})^{2}}\tag{2}\end{align*}
2) Relevance Score Measurement
We adopt the same calculations in [13] to measure the relevance scores between documents and the search requests in this paper. We assume that \begin{equation*}score(V_{d_{i}},V_{Q})=V_{d_{i}}\cdot V_{Q} =\sum _{j=1}^{n}V_{d_{i}}[j]\times V_{Q}[j]\tag{3}\end{equation*}
3) Secure Inner Product Operation
The secure inner product operation [20] is adopted in this paper. The operation is capable of computing the inner product of two encrypted vectors even their plaintext values are unknown. We assume that \begin{align*} \widetilde {p}\cdot \widetilde {q}=&(pM^{-1})\cdot (qM^{T}) \\=&pM^{-1}(qM^{T})^{T} \\=&pM^{-1}Mq \\=&p\cdot q\tag{4}\end{align*}
Therefore,
4) Normalized Google-Distance
Given two keywords \begin{equation*} Dist(w_{i},w_{j})= \frac {max\{log_{2}F_{i},log_{2}F_{i,j}\}-log_{2}F_{i,j}} {log_{2}m-min\{log_{2}F_{i},log_{2}F_{j} \}}\tag{5}\end{equation*}
In Eq. (5),
Models and Problem Description
A. System Model
The system model adopted in this paper is the same as [26] which has four entities: the data owner (DO), the data user (DU), the private cloud (Pri-Cloud) and the public cloud (Pub-Cloud). The cooperation of them is shown in FIGURE 1.
DO owns the sensitive data. To protect the privacy of its data, DO encrypts documents and the corresponding vectors and then outsources the encrypted data in Pub-Cloud. DO constructs DFB-vectors as the index to speed the search efficiency which are stored in Pri-Cloud. Besides, DO has the privilege to grant the authorization of accessing the outsourced data to DU.
DU is the user authorized by DO, who is authorized to search the data outsourced in Pub-Cloud. Once DU starts a ranked multi-keyword search, the queried keywords are transformed into a corresponding trapdoor and a QFB-vector which are submitted to Pub-Cloud and Pri-Cloud respectively for processing ranked searches. After DU receives the search result from Pub-Cloud, it decrypts the encrypted data to get the plaintext result.
Pri-Cloud is in charge of storing the index which are DFB-vectors of documents. Once receiving the QFB-vector of a query from DU, it performs the bitwise AND operations between the QFB-vector and DFB-vectors and filters out the candidate document IDs for the query and then transmits the IDs to Pub-Cloud for the further search processing.
Pub-Cloud is in charge of storing the outsourced data from DO. Once receiving the trapdoor and the candidate document IDs from DU and Pri-Cloud respectively, it performs the ranked search on the encrypted documents whose IDs are in the received candidate document IDs and then returns the result encrypted documents to DU.
B. Search Model
Given a set of
.$|R| = k \wedge \forall d_{i},d_{j}(d_{i} \in R \wedge d_{j} \in (D-R)) \rightarrow score(V_{d_{i}},V_{Q}) \geq score(V_{d_{j}},V_{Q})$ The documents in
are ranked according to the relevance scores between them and$R$ .$Q$
C. Problem Description
In our scheme, we consider the same threat model as [26], which assumes that DO, DU and Pri-Cloud are trusted, but Pub-Cloud is considered as ”honest-but-curious”. It means that Pub-Cloud always processes the pre-deployed algorithms honestly and returns results correctly, but it is curious of peeping the plaintext of the outsourced data, which could cause privacy leakage through data analysis and deduction. We assume that Pub-Cloud has the encrypted data outsourced by DO, but it does not have the secure keys. In accordance with the background of Pub-Cloud, two threat models are adopted as follows, which are also adopted in many related works [5], [6], [11]–[13], [15], [17], [25].
Known Encryption Model. In this model, Pub-Cloud only knows the encrypted documents
and the trapdoor$\widetilde {D}$ but it does not have any plaintext information about them. It means that Pub-Cloud has to perform ciphertext-only attack (COA) [28] to observe the plaintext data.$\widetilde {V}_{Q} $ Known Background Model. In this model, Pub-Cloud is assumed to have more knowledge than the known ciphertext threat model, such as the keyword frequency statistics of document collection. The statistical information reveals the quantity of documents of specific keywords in
, which could be used by Pub-Cloud to apply TF statistical attacks and hence infer or even recognize certain keywords through analyzing the histogram or value range of the corresponding frequency distributions [11], [12], [29].$D$
In this paper, we focus on the multi-keyword ranked search scheme over encrypted data in hybrid clouds. The designed goals are as follows.
1) Multi-Keywords Ranked Search
The proposed scheme is designed that Pub-Cloud can determine the ranked
2) Search Efficiency
The proposed scheme is able to perform efficient multi-keyword ranked searches by using a special index constructed on the basis of the given keyword partition vector model and the complete binary tree structure. The index can filter out candidate documents and prune a large number of irrelevant documents.
Keyword Partition Vector Model
To describe the keyword partition vector model (KPVM), the clustering based keyword partition algorithm is first introduced in this section. Then the keyword partition based bit vectors are defined formally, which are the index of the proposed scheme.
A. Clustering Based Keyword Partition
We design the algorithm
Algorithm 1 $GenPartitions(W,\tau)$
The keyword dictionary
The keyword partition list
Initialize
Add
while
Apply the bisecting
end while
return
In Algorithm 1, max (
In addition, there are two reasons for the selection of the bisecting
Observation 1:
According to Algorithm 1, we have the following two properties about the generated keyword partitions.
$\bigcup _{P_{i} \in PL} P_{i} =W$ $\forall P_{i},P_{j} \in PL \rightarrow P_{i} \cap P_{j} = \emptyset $
Observation 1 can be deduced from the procedures of Algorithm 1. This observation indicates that the generated partitions are the divisions of the keyword dictionary and there are no intersections between any two partitions.
Definition 1 (Involved Partitions):
Given a document \begin{equation*} I\!P\!S(d_{i})=\{p_{j}|p_{j}\cap d_{i} \neq \emptyset \wedge P_{j} \in PL \}\tag{6}\end{equation*}
Definition 2 (Covered Documents):
Given a keyword partition \begin{equation*} C\!D\!S(P_{i})=\{d_{j}|d_{j}\cap P_{i} \neq \emptyset \wedge d_{j} \in D \}\tag{7}\end{equation*}
According to Definition 1 and 2, we can deduce Observation 2 as follows.
Observation 2:
Given a document \begin{equation*} d_{i} \in C\!D\!S(P_{j}) \leftrightarrow P_{j} \in I\!P\!S(d_{i})\tag{8}\end{equation*}
Definition 3 (Target Partitions):
Given a query \begin{equation*} T\!P\!S(Q)=\{P_{i}| Q \cap P_{i} \neq \emptyset \wedge P_{i} \in PL \}\tag{9}\end{equation*}
Definition 4 (Candidate Documents):
Given a query \begin{equation*} C\!Docs(Q)=\bigcup _{P_{i}\in T\!P\!S(Q)} C\!D\!S(P_{i})\tag{10}\end{equation*}
We give the example to describe the above algorithm and definitions.
Example 1:
FIGURE 2 illustrates an example of keyword partitions generated by
B. Keyword Partition Based Bit Vectors
Definition 5 (Document Filtering Bit Vector (DFB-Vector)):
Given a document \begin{equation*} V\!F_{d_{i}}[j]= \begin{cases} 1,& \exists w_{P} \in d_{i}(w_{P} \in P_{j})\\ 0,& Else \end{cases}j \in \{1,2,\ldots,\tau \}\tag{11}\end{equation*}
Definition 6 (Query Filtering Bit Vector (QFB-Vector)):
Given a query \begin{equation*} V\!F_{Q}[i]= \begin{cases} 1,& \exists w_{P} \in Q(w_{P} \in P_{i})\\ 0,& Else \end{cases}i \in \{1,2,\ldots,\tau \}\tag{12}\end{equation*}
According to Definition 5 and 6, we have that the DFB-vector indicates the involved partitions of the corresponding document while the QFB-Vector indicates the target partitions of a query. For example, if
Observation 3.
Given a query \begin{equation*} C\!Docs(Q)=\{d_{i} | V\!F_{d_{i}}\& V\!F_{Q} \neq \{0\}^\tau \wedge d_{i} \in D\}\tag{13}\end{equation*}
Observation 3 indicates that, given a document
We also take Example 1 as an example where ten documents and the query
MRSE-HC Scheme
Based on the keyword partition vector model, we give the details procedures of our scheme. We first give the framework of MRSE-HC shown in FIGURE 4. It consists of two stages which are setup stage and search stage. Before providing ranked search services, DO performs the algorithms,
A. Algorithms in Setup Stage
1) $SK \leftarrow GenKey(1^{l(n)})$
DO generates the secure key
2) $PL \leftarrow GenPartitions(W, \tau)$
This algorithm is given in Algorithm 1 in detail. It is on the basis of the bisecting
3) $\{V_{D}, VF_{D}\} \leftarrow GenVectors(D, PL)$
For each
4) $\{\tilde{D}, \tilde{V}_{D} \} \leftarrow EncData(D, V_{D}, SK)$
For each
5) $StoreData(VF_{D}, \tilde{D}, \tilde{V}_{D})$
After the above steps, DO outsources the encrypted documents
After the five steps, the setup stage is finished and the system is prepared for the multi-keyword search over encrypted documents.
B. Algorithms in Search Stage
1) $\tilde{V}_{Q} \leftarrow GenTrapdoor(Q, SK)$
Once a query
2) $VF_{Q} \leftarrow GenQFBVector(Q, PL)$
According to Definition 6, DU generates the QFB-vector of
3) $CID \leftarrow Filtering(VF_{Q}, VF_{D})$
Pri-Cloud utilizes the DFB-vectors as the index to filter out the candidate documents for the query. It performs the bitwise AND operation between each DFB-vector and the received QFB-vector and then find out the corresponding IDs of the candidate documents for the query
4) $\Re \leftarrow Searching(\tilde{D}, \tilde{V}_{D}, \tilde{V}_{Q}, CID, k)$
Pub-Cloud computes the inner products between the trapdoor
At last, when DU receives the result encrypted documents
Lemma 1:
Proof:
\begin{align*} \widetilde {V}_{Q} \cdot \widetilde {V}_{d_{i}}=&\{V_{Q}^{1}\!M_{1}^{-1},V_{Q}^{2}\!M_{2}^{-1}\}\cdot \{V_{d_{i}}^{1}\!M_{1}^{T},V_{d_{i}}^{2}\!M_{2}^{T}\}^{T} \\=&(V_{Q}^{1}\!M_{1}^{-1})\cdot (V_{d_{i}}^{1}\!M_{1}^{T})^{T} + (V_{Q}^{2}\!M_{2}^{-1})\cdot (V_{d_{i}}^{2}\!M_{2}^{T})^{T}\\=&V_{Q}^{1}\!M_{1}^{-1}\!M_{1}(V_{d_{i}}^{1})^{T} +V_{Q}^{2}\!M_{2}^{-1}\!M_{2}(V_{d_{i}}^{2})^{T}\\=&V_{Q}^{1}(V_{d_{i}}^{1})^{T}+V_{Q}^{2}(V_{d_{i}}^{2})^{T}\\=&V_{Q} \cdot V_{d_{i}}\end{align*}
In the above algorithms, DU transforms the query request into two vectors, one is the trapdoor (the encrypted query vector) generated in
The Enhanced Scheme
In this section, we propose the enhanced scheme EMRSE-HC which is designed to improve the efficiency of MRSE-HC. EMRSE-HC utilizes the complete binary pruning tree (CBP-Tree) instead of the sequential DFB-vectors and the corresponding CBP-Tree based filtering algorithm is adopted. Compared with MRSE-HC, EMRSE-HC adds the CBP-Tree construction algorithm
. In EMRSE-HC scheme, the CBP-Tree, denoted as$\mathcal {I}\leftarrow BulidCBPT(D)$ , is constructed as index. Each node in$\mathcal {I}$ stores the DFB-vector and the pruning vector, the latter of which is used for improving the search efficiency. Details of this algorithm is illustrated in Section VII.A.$\mathcal {I}$ . In EMRSE-HC scheme,$StoreData(\mathcal {I},\widetilde {D},\widetilde {V}_{D})$ is stored in Pri-Cloud. And the storage of$\mathcal {I}$ and$\widetilde {D}$ are the same as MRSE-HC.$\widetilde {V}_{D}$ . In EMRSE-HC scheme, the$C\!I\!D\leftarrow Filtering(\mathcal {I}, i, V\!F_{Q})$ algorithm utilizes the CBP-Tree to filter out the unqualified documents efficiently and generates the candidate documents. Details of the updated$Filtering$ algorithm is shown in Section VII.B.$Filtering$
A. CBP-Tree Construction Algorithm
The CBP-Tree is the index of EMRSE-HC, which is a complete binary tree. Each node corresponds to a document. The data structure of the node is defined as:\begin{equation*} < docI\!D,df\!v,pv>,\tag{14}\end{equation*}
According to [32], we know that an array is an appropriate structure to store a complete binary tree. Thus, we utilize the array
Algorithm 2 $BulidC\!B\!P\!T(D,\mathcal{I})$
The document collection
CBP-Tree
for
Settle the node
end for
for
if
else
end if
end for
return
In Algorithm 2, if the node
We take the same example as Example 1 to illustrate Algorithm 2. The DFB-vectors of documents in Example 1 is shown in FIGURE 4. Taking those DFB-vectors as the input of Algorithm 2 and the constructed CBP-Tree is shown in FIGURE 5.
B. CBP-Tree Based Filtering Algorithm
DO stores the CBP-Tree to Pri-Cloud. Pri-Cloud utilizes the pruning vectors of nodes in the CBP-Tree to prune unqualified subtrees. The CBP-Tree based Filtering algorithm is shown in Algorithm 3.
Algorithm 3 ${Filtering(\mathcal{I}, i, V\!F_{Q}, C\!I\!D)}$
if
if
if
Add
end if
else
if
if
Add
end if
end if
end if
end if
Algorithm 3 is a recursion algorithm that starts with the root node. Algorithm 3 can find all candidate nodes which are coinciding the result of bitwise AND operation between its DFB-vector and QFB-vector is not equal to 0. DFB-vector in each node is utilized to early prune unqualified nodes. The pruning vector is used to improve the filter efficiency. Because the height of the CBP-Tree is
The enhanced
C. Security Enhancement
In the search stage, the Pub-Cloud generates different trapdoors when the same queries are applied but the candidate documents and the calculated relevance scores are the same. Pub-Cloud could use those convert channels to link the same search requests and deduce the hot keywords with high frequency in documents. To overcome this problem, it is a practical and effective countermeasure to extend dimension by adding some phantom terms into vectors to break such convert channels [5], [6], [11], [13]. We also introduce this method to enhance the security of our scheme and to protect the document confidentiality, index and trapdoor privacy and trapdoor unlinkability.
We adopt the similar phantom item addition method of [5], [6], [11], [13] for providing trapdoor unlinkability. The brief idea is introduced as follows. First, DO randomly generates two
Adding the phantom item in the document vectors and the query vectors affects the accuracy of the relevance score and the query results. But the trapdoor unlinkability is preserved. In addition, the standard deviation parameter
D. Security Analysis
In this section, we analyze the EMRSE-HC scheme according to the three privacy demands.
1) Document Confidentiality
In the scheme, the documents and the corresponding vectors are encrypted and then outsourced in Pub-Cloud. The encryption procedures are given in the
2) Index and Trapdoor Privacy
In the scheme, the index is the CBP-Tree stored in Pri-Cloud which is isolated from Pub-Cloud. Trapdoors are generated by performing secure inner product operation on the query vectors with the secret key SK which is only shared by DO and DU. Additionally, several random items are added in the trapdoor generation process, which increases the randomness of values in vectors. Without the secret key and the phantom item addition method, it is hard for Pub-Cloud to get the plaintext query vectors. As a result, the scheme can protect the privacy of the index and the trapdoor.
3) Trapdoor Unlinkability
In this scheme, the probability that
The possibility of the same query generates the same trapdoor
Performance Evaluation
In this section, we evaluate the performance of our proposed basic scheme MRSE-HC, efficiency enhanced scheme EMRSE-HC and compare them with the scheme presented in [26] which is denoted as FMRS. We implement MRSE-HC, EMRSE-HC and FMRS and perform the evaluations on the search time cost on the real data set of NSF Research Award Abstracts provided by UCI [33]. The real dataset includes about 129000 abstracts. We use IK Analyzer [34] to extract the keywords of documents and then process the extracted keywords.
The experimental hardware environment is INTEL Core(TM) I5-8250 CPU, 4G memory, and 588G hard disk; and software environment is Eclipse development platform. Since the queried keywords are usually relevant with each other in practice, we assume that the default number of target partitions of a query is one. Other default parameters are summarized in Table 1 where
In the following experiments, we evaluate the time cost of searches where one of the above parameters changes and the other parameters adopt the default values. The results are shown in FIGURE 6–10.
FIGURE 6–10 all show that the proposed MRSE-HC and EMRSE-HC both outperform FMRS in the time cost of ranked searches. Among them, EMRSE-HC is the most efficiently. On average, EMRSE-HC saves about 20% and 80% of the time cost respectively. The reason is that: (1) The target partitions of MRSE-HC and EMRSE-HC are usually less than FMRS when a query is applied because the partitions generated in MRSE-HC or EMRSE-HC is clustered according to the relevance between keywords and the queried keywords are usually relevant to each other. The less are the target partitions, the less are the candidate documents determined, which improves the search efficiency. (2) In EMRSE-HC, the tree index, CBP-Tree is adopted and a large number of irrelevant documents are filtered out. Thus, the search efficiency is improved.
We analyze the impacts of the changes of
FIGURE 6 indicates that, as
grows up, the time cost of MRSE-HC, EMRSE-HC and FRMS all decrease. The reason is that, when$\tau $ grows up, more partitions are generated, and the keywords in each partition and the covered documents of partitions both decrease. Since the target partitions are settled constantly, the total covered documents of the target partitions, which are the candidate documents, decrease simultaneously. Thus, the time cost of both schemes decreases.$\tau $ FIGURE 7 shows that the increase of the number of request documents
have little impact on the time cost of MRSE-HC, EMRSE-HC and FMRS. The reason is that the search conditions (such as partitions, vectors and documents) remain unchanged as$k$ growing up, thus the candidate documents in both schemes are still the same and the time cost of both schemes just randomly oscillate around a certain level.$k$ FIGURE 8 indicates that as the number of queried keywords
grows up, the time cost of FMRS increases while the time cost of MRSE-HC and EMRSE-HC just have random oscillations. The reason is that keywords in FMRS are equally divided and randomly distributed in partitions while keywords in MRSE-HC and EMRSE-HC are clustered with high relevance in partitions. When$t$ grows up, the number of target partitions proportionally increases in FMRS. But it has slight changes in MRSE-HC and EMSE-HC. Since the corresponding candidate documents have proportional relations with the target partitions, the time cost of FMRS increases while the time cost of MRSE-HC and EMRSE-HC just have slight changes.$t$ FIGURE 9 indicates that the time cost of MRSE-HC, EMRSE-HC and FMRS all increase as
grows up. The reason is that, when the number of documents$m$ increases, the average number of covered documents of each partition grows up simultaneously. Since the target partitions of both schemes do not change, the total covered documents of the target partitions increase. Thus the candidate documents increase which consumes more time cost to perform the ranked searches.$m$ FIGURE 10 shows that, as
grows up, the time cost of MRSE-HC, EMSE-HC and FMRS increase. The reason is that the average number of keywords in each partition grows up as the dictionary capacity$n$ increases, which causes the covered documents of each partition increase. Since the target partitions of both schemes have no changes, the candidate documents (which is the covered documents of the target partitions) increase simultaneously. In addition, the dimensions of document vectors and query vectors are both enlarged when$n$ grows up. It will consume more time in the relevance score calculations. Thus, all schemes consume more time to accomplish ranked searches.$n$ FIGURE 11 shows that the space cost of the indexes in MRSE-HC, EMRSE-HC and FMRS all increase as
grows up. The reason is that, when$\tau $ grows up, more partitions are generated and the dimensions of document filtering bit vectors increase simultaneous. The space cost of indexes in MRSE-HC and FMRS are the same. The reason is that the keywords in MRSE-HC and FMRS are the same thus the vectors in the indexes of them are also the same. In addition, the space cost of index in EMRSE-HC is about twice as it in MRSE-HC. The reason is that the index in EMRSE-HC is a complete binary tree, where the total nodes are nearly twice as the leaf nodes. Here, the leaf nodes store the same vectors of MRSE-HC and the internal nodes store the same structure of vectors as the leaf nodes.$\tau $
Conclusion
It is still a challenge to ensure the efficiency of the search under the premise of ensuring the accuracy of the search results. And most of the existing multi-keyword ranked search over encrypted data are for the public cloud. In this paper, we propose a privacy-preserving Multi-Keyword Ranked Search over Encrypted data in hybrid clouds,which is denoted as MRSE-HC. In our scheme, the keyword dictionary of documents are balancing clustered in partitions by the keyword partition algorithm on the basis of the bisecting
ACKNOWLEDGMENT
Compared with the preliminary version [1].