Loading web-font TeX/Math/Italic
A Privacy-Preserving Multi-Keyword Ranked Search Over Encrypted Data in Hybrid Clouds | IEEE Journals & Magazine | IEEE Xplore

A Privacy-Preserving Multi-Keyword Ranked Search Over Encrypted Data in Hybrid Clouds


This paper proposed a privacy-preserving multi-keyword ranked search scheme over encrypted data in hybrid clouds. The private cloud is used to filter out the candidate do...

Abstract:

With the rapid development of cloud computing services, more and more individuals and enterprises prefer to outsource their data or computing to clouds. In order to prese...Show More
Topic: Emerging Approaches to Cyber Security

Abstract:

With the rapid development of cloud computing services, more and more individuals and enterprises prefer to outsource their data or computing to clouds. In order to preserve data privacy, the data should be encrypted before outsourcing and it is a challenge to perform searches over encrypted data. In this paper, we propose a privacy-preserving multi-keyword ranked search scheme over encrypted data in hybrid clouds, which is denoted as MRSE-HC. The keyword dictionary of documents is clustered into balanced partitions by a bisecting k -means clustering based keyword partition algorithm. According to the partitions, the keyword partition based bit vectors are adopted for documents and queries which are utilized as the index of searches. The private cloud filters out the candidate documents by the keyword partition based bit vectors, and then the public cloud uses the trapdoor to determine the result in the candidates. On the basis of the MRSE-HC scheme, an enhancement scheme EMRSE-HC is proposed, which adds complete binary pruning tree to further improve search efficiency. The security analysis and performance evaluation show that MRSE-HC and EMRSE-HC are privacy-preserving multi-keyword ranked search schemes for hybrid clouds and outperforms the existing scheme FMRS in terms of search efficiency.
Topic: Emerging Approaches to Cyber Security
This paper proposed a privacy-preserving multi-keyword ranked search scheme over encrypted data in hybrid clouds. The private cloud is used to filter out the candidate do...
Published in: IEEE Access ( Volume: 8)
Page(s): 4895 - 4907
Date of Publication: 31 December 2019
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Nowadays, the cloud computing technology is considered as a rapid developing and popular model of distributed computing and storage, which has the advantages of high-quality data storage, quick and convenient computing and “on-demand service”, etc. Outsourcing service built on the cloud can effectively reduce the maintenance cost of enterprises purchasing hardware and software and managing data. Attracted by the convenience, economy and high scalability appealing features, more and more individuals and enterprises are motivated to outsource their data or computing to the cloud. However, in the outsourcing cloud, Data Owner(DO) is unable to directly control and manage data stored on the Cloud Server(CS), thus, DO cannot certain data whether be protected and whether be legally and reasonably used and computed, which leads to the privacy of data is seriously threatened. At present, the privacy protection exists in the outsourcing cloud has become a major obstacle impeding its further development [2].

In the outsourced cloud, a native scheme proposed to protect data confidentiality is to encrypt data before outsourcing data to cloud. However, encrypted data cannot be directly searched and used. When the scale of data is smaller, DO can download all data to local computer and then decrypt these data, thereby obtain needed information from plaintext data. But in the current increasingly popular Big Data applications, utilize this method will cause a huge cost of time and bandwidth in terms of acquiring needed information, therefore this method does not possess essential practicality. Therefore, it is a challenge to perform privacy-preserving ranked search over encrypted cloud data.

In this paper, we propose a privacy-preserving multi-keyword ranked search over encrypted data in hybrid clouds. The keyword partition vector model is presented, in which the keywords of documents are clustered by a given bisecting $k$ -means clustering algorithm and multiple balanced partitions are generated. Keywords are with high relevance scores in a partition. And the relevance score is calculated by the Normalized Google-Distance [3]. According to the generated partitions, the document filtering bit vector (DFB-vectors) and the query filtering bit vector (QFB-vector) are defined for documents and queries respectively. The former is the index for performing efficient searches while the later is used as the query command. There are two mainly stages in the proposed scheme which are the setup stage and the search stage. In the setup stage, the keyword partitions are first clustered and then DFB-vectors are created and deployed in Pri-Cloud. Documents and the corresponding vectors are encrypted and outsourced to Pub-Cloud. In the search stage, when a query having multi-keywords is started, the corresponding QFB-vector and trapdoor are respectively generated and submitted to Pri-Cloud and Pub-Cloud. In Pri-Cloud, the QFB-vector and DFB-vectors are used to filter out the candidate documents corresponding to the target partitions of the query, and then the IDs of the candidate documents are given to Pub-Cloud. After that, Pub-Cloud use the trapdoor and candidate IDs are used to determine the result encrypted documents. Because keywords in a partition are in high relevance with each other and the keywords of a query are usually relevant to each other in practice, the number of target partitions is small and the candidate documents are less correspondingly. Therefore, the proposed scheme is efficient which is indicated in the performance evaluation result. To improve the search efficiency, an enhanced scheme EMRSE-HC is proposed, which introduces a complete binary tree-based index structure and an optimized filtering algorithm.

The contributions of this paper are as follows.

  1. We proposed a keyword partition vector model. In this model, a bisecting $k$ -means clustering based keyword partition algorithm is proposed, which generates the balanced keyword partitions and the keyword partition based bit vectors (DFB-vector and QFB-vector). DFB-vectors are the index for searches.

  2. On the basis of the keyword partition vector model and the complete binary tree structure, we propose an efficient ranked search scheme over encrypted data in hybrid clouds. The private cloud filters out the candidate documents, and then the public cloud determines the result.

  3. We analyze the security of the proposed scheme and evaluate its search performance. The result shows that the proposed scheme is a privacy-preserving multi-keyword ranked search scheme for hybrid clouds and outperforms the existing scheme FMRS in terms of search efficiency.

SECTION II.

Related Work

To support the multi-keyword search over the outsourced encrypted cloud data, researchers have proposed many Searchable Encryption (SE) schemes [4]–​[18].

Song et al. [4] proposed the first symmetric searchable encryption (SSE) scheme. Cao et al. [5], [6] proposed the first multi-keyword ranked search scheme. The vector space model (VSM) [19] and secure KNN [20] are adopted to achieve the privacy-preserving ranked searches. Xu et al. [7] proposed a two-step-ranking search scheme over encrypted cloud data which adopts the order-preserving encryption (OPE) [21], [22]. Yang et al. [8] proposed a fast privacy-preserving multi-keyword search scheme. It supports dynamic updates on documents. Li et al. [9], [10] proposed a fine-grained multi-keyword search scheme over encrypted cloud data. However, only boolean queries are supported. Xia et al. [11] proposed a secure and dynamic multi-keyword ranked search scheme by adopting a balanced binary tree index. Chen et al. [13] and Zhu et al. [12] proposed two different privacy-preserving ranked search schemes, which both utilize clustering algorithm to improve search efficiency.

Fu et al. [15] and Wang et al. [14] proposed multi-keyword fuzzy search schemes over encrypted outsourced data. To achieve fuzzy search, the locality-sensitive hashing functions [23], wordnet and secure KNN are adopted. Wang et al. [16] presented a multi-keyword fuzzy search scheme which supports range queries by adopting the locality-sensitive hashing functions, bloom filtering [24] and order-preserving encryption. Fu et al. [17] proposed a synonym expansion of document keywords and realized the synonym-based multi-keyword ranked search scheme. Xia et al. [18] proposed a multi-keyword semantic ranked search scheme where the inverted index for documents and the semantic relationship library for keywords are adopted. Fu et al. [25] proposed a different semantic-aware ranked search scheme which adopts the concept hierarchy and the semantic relationship between concepts.

According to the state-of-art, most of the existing works focus on the public clouds. Only Yang et al. [26] proposed a search scheme for the hybrid clouds which consist of the public cloud (Pub-Cloud) and the private cloud (Pri-Cloud). In the scheme, Pri-Cloud is assumed trust while Pub-Cloud is assumed honest-but-curious. Keywords of documents are equally divided into multiple partitions and document index vectors are created for each document according to the partitions. Pri-Cloud utilizes document index vectors to obtain candidate document identities and then Pub-Cloud determines the result encrypted documents whose identities are the candidates. The more partitions the searched keywords cover, the more candidate document identities Pri-Cloud obtains. The search efficiency is proportional to the number of partitions covering the queried keywords. In practical, the searched keywords are usually relevant to each other. For example, basketball, NBA, slam dunk could be the queried keywords for retrieving the interested news, and they are obviously relevant. Therefore, if the keywords with high relevance are gathered in fewer partitions, then the search efficiency will be improved when the searched keywords are relevant.

SECTION III.

Notations and Preliminaries

A. Notations

  • $d_{i}$ — A plaintext document.

  • $D$ — A plaintext document collection, $D=\{d_{1},d_{2},,d_{m}\}$ .

  • $V_{d_{i}}$ — The $n$ -dimensional document vector of $d_{i}$ .

  • $V_{D} $ — The set of document vectors of documents in $D$ , $V_{D}=\{V_{d_{1}},V_{d_{2}},\ldots,V_{d_{m}} \}$ .

  • $\widetilde {d}_{i} $ — The encrypted document of $d_{i}$ .

  • $\widetilde {D}$ — The encrypted document collection of $D$ , $\widetilde {D}=\{\widetilde {d}_{1},\widetilde {d}_{2},\ldots,\widetilde {d}_{m}\}$ .

  • $\widetilde {V}_{d_{i}}$ — The encrypted $n$ -dimensional document vector of $d_{i}$ .

  • $\widetilde {V}_{D}$ — The set of encrypted documents vectors, $\widetilde {V}_{D}=\{\widetilde {V}_{d_{1}},\widetilde {V}_{d_{2}},\ldots,\widetilde {V}_{d_{m}}\}$ .

  • $W$ — A keyword dictionary having $n$ keywords, $W=\{w_{1},w_{2},\ldots,w_{n}\}$ .

  • $P\!L$ — A list of keyword partitions, $P\!L=\{P_{1}, P_{2}, \ldots,\,\, P_\tau \}$ .

  • $V\!F_{d_{i}}$ — The $\tau $ -dimensional DFB-vector of $d_{i}$ .

  • $V\!F_{D}$ — The set of DFB-vectors of the documents, $V\!F_{D}=\{V\!F_{d_{1}},V\!F_{d_{2}},\ldots,V\!F_{d_{m}}\}$ .

  • $Q$ — A query request with multi-keywords.

  • $V_{Q}$ — The $n$ -dimensional query vector of $Q$ .

  • $\widetilde {V}_{Q}$ — The trapdoor of $Q$ which is the encrypted $n$ -dimensional query vector.

  • $V\!F_{Q}$ — The $\tau $ -dimensional QFB-vector of $Q$ .

  • $C\!I\!D$ — A set of candidate document IDs for the query $Q$ .

B. Preliminaries

1) Vector Space Model

The vector space model [19] adopting TF-IDF model [27] is widely adopted in secure multi-keyword search [5], [6], [11]–​[13]. We also use such models in this paper. TF and IDF are the term frequency (TF) and inverse document frequency (IDF). The former is the number of times a given keyword or term exists in documents while the later is calculated by dividing the total number of all documents by the number of documents having the given keyword or term. Each document $d_{i}$ is described by a $n$ -dimensional vector where $n$ is the scale of the keyword dictionary. $V_{d_{i}}[j] $ stores the normalized TF value of the keyword $w_{j}$ as shown in Eq. (1). For a query $Q$ having multiple searched keywords, the $n$ -dimensional vector $V_{Q}$ stores the normalized IDF values of the searched keywords in $Q$ . The calculation of $V_{Q}[j]$ is shown in Eq. (2).\begin{align*} V_{d_{i}}[j]=&T\!F_{d_{i},w_{j}}/\sqrt {\sum _{w_{j} \in d_{i} \wedge d_{i} \in D}(T\!F_{d_{i},w_{j}})^{2}}\tag{1}\\ V_{Q}[j]=&I\!D\!F_{w_{j}}/\sqrt {\sum _{w_{j}\in Q}(IDF_{w_{j}})^{2}}\tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features.

2) Relevance Score Measurement

We adopt the same calculations in [13] to measure the relevance scores between documents and the search requests in this paper. We assume that $d_{i}$ is a document and $Q$ is a query request having multiple searched keywords, the relevance score between $d_{i}$ and $Q$ is calculated by the inner product between the corresponding document vector $V_{d_{i}}$ and the query vector $V_{Q}$ , i.e.\begin{equation*}score(V_{d_{i}},V_{Q})=V_{d_{i}}\cdot V_{Q} =\sum _{j=1}^{n}V_{d_{i}}[j]\times V_{Q}[j]\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features.

3) Secure Inner Product Operation

The secure inner product operation [20] is adopted in this paper. The operation is capable of computing the inner product of two encrypted vectors even their plaintext values are unknown. We assume that $p$ and $q$ are two $n$ -dimensional vectors and $M$ is a random $n\times n$ -dimensional invertible matrix. Here, $M$ is used as the secure key. We denote $\widetilde {p}$ and $\widetilde {q}$ as the the encrypted form of plaintext vectors $p$ and $q$ respectively. And $\widetilde {p}$ and $\widetilde {q}$ is calculated by $\widetilde {p}=pM^{-1} $ and $\widetilde {q}=qM^{T}$ . Then we have \begin{align*} \widetilde {p}\cdot \widetilde {q}=&(pM^{-1})\cdot (qM^{T}) \\=&pM^{-1}(qM^{T})^{T} \\=&pM^{-1}Mq \\=&p\cdot q\tag{4}\end{align*} View SourceRight-click on figure for MathML and additional features.

Therefore, $\widetilde {p}\cdot \widetilde {q}=p\cdot q$ holds which indicates that the inner product of two encrypted vectors equals the inner product of the corresponding plaintext vectors.

4) Normalized Google-Distance

Given two keywords $w_{i}$ and $w_{j}$ , the Normalized Google-Distance [3] between them is denoted as $Dist(w_{i}, w_{j})$ where \begin{equation*} Dist(w_{i},w_{j})= \frac {max\{log_{2}F_{i},log_{2}F_{i,j}\}-log_{2}F_{i,j}} {log_{2}m-min\{log_{2}F_{i},log_{2}F_{j} \}}\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In Eq. (5), $F_{i}$ and $F_{j}$ are the total frequencies of $w_{i}$ and $w_{j}$ appearing in the documents of $D$ respectively, $F_{i,j}$ is the number of documents where $w_{i}$ and $w_{j}$ both appear in the same documents, $m$ is the number of documents of $D$ and $min\{X\}$ is to get the minimum from the set $X$ . According to [3], the distance can be used to represent the relevance between keywords. The relevance between $w_{i}$ and $w_{j}$ increases as $Dist(w_{i}, w_{j})$ decreases. The Normalized Google-Distance in the paper is only used for calculating the relevance between keywords.

SECTION IV.

Models and Problem Description

A. System Model

The system model adopted in this paper is the same as [26] which has four entities: the data owner (DO), the data user (DU), the private cloud (Pri-Cloud) and the public cloud (Pub-Cloud). The cooperation of them is shown in FIGURE 1.

  1. DO owns the sensitive data. To protect the privacy of its data, DO encrypts documents and the corresponding vectors and then outsources the encrypted data in Pub-Cloud. DO constructs DFB-vectors as the index to speed the search efficiency which are stored in Pri-Cloud. Besides, DO has the privilege to grant the authorization of accessing the outsourced data to DU.

  2. DU is the user authorized by DO, who is authorized to search the data outsourced in Pub-Cloud. Once DU starts a ranked multi-keyword search, the queried keywords are transformed into a corresponding trapdoor and a QFB-vector which are submitted to Pub-Cloud and Pri-Cloud respectively for processing ranked searches. After DU receives the search result from Pub-Cloud, it decrypts the encrypted data to get the plaintext result.

  3. Pri-Cloud is in charge of storing the index which are DFB-vectors of documents. Once receiving the QFB-vector of a query from DU, it performs the bitwise AND operations between the QFB-vector and DFB-vectors and filters out the candidate document IDs for the query and then transmits the IDs to Pub-Cloud for the further search processing.

  4. Pub-Cloud is in charge of storing the outsourced data from DO. Once receiving the trapdoor and the candidate document IDs from DU and Pri-Cloud respectively, it performs the ranked search on the encrypted documents whose IDs are in the received candidate document IDs and then returns the result encrypted documents to DU.

FIGURE 1. - System model.
FIGURE 1.

System model.

B. Search Model

Given a set of $t$ queried keywords $Q=\{ w_{1}, w_{2},\ldots, w_{t}\}$ , a multi-keyword ranked search is to retrieve the $k$ ranked documents that are most relevant to $Q$ . Formally, we define a multi-keyword search as $Query=(D, Q, k)$ where $k$ is the number of requested documents and $k \ll |D|$ generally. For simplicity, we use the notation $Q$ to represent the search. The result of the query $Q$ , denoted as $R$ , satisfies the following two conditions.

  1. $|R| = k \wedge \forall d_{i},d_{j}(d_{i} \in R \wedge d_{j} \in (D-R)) \rightarrow score(V_{d_{i}},V_{Q}) \geq score(V_{d_{j}},V_{Q})$ .

  2. The documents in $R$ are ranked according to the relevance scores between them and $Q$ .

C. Problem Description

In our scheme, we consider the same threat model as [26], which assumes that DO, DU and Pri-Cloud are trusted, but Pub-Cloud is considered as ”honest-but-curious”. It means that Pub-Cloud always processes the pre-deployed algorithms honestly and returns results correctly, but it is curious of peeping the plaintext of the outsourced data, which could cause privacy leakage through data analysis and deduction. We assume that Pub-Cloud has the encrypted data outsourced by DO, but it does not have the secure keys. In accordance with the background of Pub-Cloud, two threat models are adopted as follows, which are also adopted in many related works [5], [6], [11]–​[13], [15], [17], [25].

  • Known Encryption Model. In this model, Pub-Cloud only knows the encrypted documents $\widetilde {D}$ and the trapdoor $\widetilde {V}_{Q} $ but it does not have any plaintext information about them. It means that Pub-Cloud has to perform ciphertext-only attack (COA) [28] to observe the plaintext data.

  • Known Background Model. In this model, Pub-Cloud is assumed to have more knowledge than the known ciphertext threat model, such as the keyword frequency statistics of document collection. The statistical information reveals the quantity of documents of specific keywords in $D$ , which could be used by Pub-Cloud to apply TF statistical attacks and hence infer or even recognize certain keywords through analyzing the histogram or value range of the corresponding frequency distributions [11], [12], [29].

In this paper, we focus on the multi-keyword ranked search scheme over encrypted data in hybrid clouds. The designed goals are as follows.

1) Multi-Keywords Ranked Search

The proposed scheme is designed that Pub-Cloud can determine the ranked $k$ encrypted documents which have the $k$ highest relevance scores to the searched keywords through the cooperation with Pri-Cloud.

2) Search Efficiency

The proposed scheme is able to perform efficient multi-keyword ranked searches by using a special index constructed on the basis of the given keyword partition vector model and the complete binary tree structure. The index can filter out candidate documents and prune a large number of irrelevant documents.

3) Privacy-Preserving

The proposed scheme is able to preserve privacy from the curious Pub-Cloud. Particularly, the plaintext of documents, index and queried keywords should be kept in private and the trapdoor unlinkability [5], [6], [11]–​[13] should be protected.

SECTION V.

Keyword Partition Vector Model

To describe the keyword partition vector model (KPVM), the clustering based keyword partition algorithm is first introduced in this section. Then the keyword partition based bit vectors are defined formally, which are the index of the proposed scheme.

A. Clustering Based Keyword Partition

We design the algorithm $GenPartitions$ to partition the keyword dictionary $W$ which is on the basis of the bisecting $k$ -means clustering [30] and shown in Algorithm 1. The Normalized Google-Distance [3] is adopted to measure the distance between keywords. A partition list, denoted as $PL=\{P_{1},P_{2},\ldots, P_{\tau }\}$ , is the output of this algorithm where $\tau $ is a threshold to control the number of partitions.

Algorithm 1 $GenPartitions(W,\tau)$

Input:

The keyword dictionary $W$ .

Output:

The keyword partition list $PL$ .

1:

Initialize $PL=\emptyset $ ;

2:

Add $W$ to $PL$ where $W$ is treated as a keyword partition;

3:

while $|$ PL$| < \tau $ do

4:

$P_{max} = max(PL)$ ;

5:

Apply the bisecting $k$ -means clustering algorithm to the partition $P_{max}$ by using the Normalized Google-Distance of keywords, and then append the generated two keyword clusters as two partitions in $PL$ ;

6:

end while

7:

return $PL$

In Algorithm 1, max ($PL$ ) is the function to get the biggest partition of $PL$ which has the most documents. In each round of bisecting $k$ -means clustering, the biggest partition of $PL$ is divided into two smaller partitions. Therefore, Algorithm 1 is a balanced keyword partition algorithm which tends to balance the number of keywords of generated partitions even with different default parameter setting of clustering algorithm. Meanwhile, keywords with high relevance are clustered into partitions because of the adoption of the bisecting $k$ -means clustering and the Normalized Google-Distance.

In addition, there are two reasons for the selection of the bisecting $k$ -means algorithm. First, the bisecting $k$ -means algorithm has the advantage of processing documents, comparing with other clustering algorithm like DBSCAN [31]. The second reason for choosing the bisecting $k$ -means algorithm is that it can generate the balanced clusters.

Observation 1:

According to Algorithm 1, we have the following two properties about the generated keyword partitions.

  1. $\bigcup _{P_{i} \in PL} P_{i} =W$

  2. $\forall P_{i},P_{j} \in PL \rightarrow P_{i} \cap P_{j} = \emptyset $

Observation 1 can be deduced from the procedures of Algorithm 1. This observation indicates that the generated partitions are the divisions of the keyword dictionary and there are no intersections between any two partitions.

Definition 1 (Involved Partitions):

Given a document $d_{i} \in D$ , the involved partitions of $d_{i}$ are the partitions that have at least a keyword of $d_{i}$ . We denote the set of involved partitions of $d_{i}$ as $I\!P\!S(d_{i})$ , then we have \begin{equation*} I\!P\!S(d_{i})=\{p_{j}|p_{j}\cap d_{i} \neq \emptyset \wedge P_{j} \in PL \}\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Definition 2 (Covered Documents):

Given a keyword partition $P_{i} \in PL$ , the covered documents of $P_{i}$ are the documents that have at least a keyword of $P_{i}$ . We denote the set of covered documents of $P_{i}$ as $C\!D\!S(P_{i})$ , then we have \begin{equation*} C\!D\!S(P_{i})=\{d_{j}|d_{j}\cap P_{i} \neq \emptyset \wedge d_{j} \in D \}\tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features.

According to Definition 1 and 2, we can deduce Observation 2 as follows.

Observation 2:

Given a document $d_{i}$ and a keyword partition $P_{j}$ , if $d_{i}$ is in the covered documents of $P_{j}$ , then $P_{j}$ is in the involved partitions of $d_{i}$ , and vice versa.\begin{equation*} d_{i} \in C\!D\!S(P_{j}) \leftrightarrow P_{j} \in I\!P\!S(d_{i})\tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Definition 3 (Target Partitions):

Given a query $Q$ with multi-keyword, the target partitions of $Q$ is the keyword partitions that have at least one queried keywords of $Q$ . We denote the set of target partitions of $Q$ as $T\!P\!S(Q)$ , then we have \begin{equation*} T\!P\!S(Q)=\{P_{i}| Q \cap P_{i} \neq \emptyset \wedge P_{i} \in PL \}\tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Definition 4 (Candidate Documents):

Given a query $Q$ , the candidate documents of $Q$ are the covered documents of the target partitions of $Q$ . We denote the candidate documents of $Q$ as $C\!Docs(Q)$ , then we have \begin{equation*} C\!Docs(Q)=\bigcup _{P_{i}\in T\!P\!S(Q)} C\!D\!S(P_{i})\tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features.

We give the example to describe the above algorithm and definitions.

Example 1:

FIGURE 2 illustrates an example of keyword partitions generated by $GenPartitions$ . As shown in the figure, we assume that document collection $D$ has 10 documents and the threshold of keyword partitions e.g. $\tau $ is set to be 5. The list of keyword partitions is $PL=\{P_{1}, P_{2}, P_{3}, P_{4}, P_{5}\}$ and the detailed descriptions are shown in the figure. According to Definition 1, the involved partitions of $D$ are described as $I\!P\!S(d_{1})=\{P_{1}, P_{2}, P_{3}\}$ , $I\!P\!S(d_{2})=\{P_{2}, P_{3}, P_{4}\}$ , $I\!P\!S(d_{3})=\{P_{3}, P_{4}, P_{5}\}$ , etc. According to Definition 2, the covered documents of $P_{1}$ , $P_{2}$ , $P_{3}$ , $P_{4}$ and $P_{5}$ are expressed as $C\!D\!S(P_{1})=\{d_{1}, d_{4}, d_{9}\}$ , $C\!D\!S(P_{2})=\{d_{1}, d_{2}, d_{4}, d_{9}\}$ , etc. Suppose the query $Q =\{w_{1}, w_{4}, w_{7}, w_{16}\}$ , according to Definition 3 and 4, then we have the target partitions $T\!P\!S(Q)=\{P_{1}, P_{2}\}$ and the corresponding candidate documents $C\!Docs(Q)=C\!D\!S(P_{1})\cup C\!D\!S(P_{2})=\{d_{1}, d_{2}, d_{4}, d_{9}\}$ . Obviously, the query result must be in $C\!Docs(Q)$ and the other documents could be out of consideration directly.

FIGURE 2. - Example of 
$GenPartitions$
.
FIGURE 2.

Example of $GenPartitions$ .

B. Keyword Partition Based Bit Vectors

Definition 5 (Document Filtering Bit Vector (DFB-Vector)):

Given a document $d_{i} \in D$ , the DFB-vector of $d_{i}$ is a $\tau $ -dimensional bit vector which is denoted as $V\!F_{d_{i}}$ . If there is a keyword in $d_{i}$ belongs to a partition $P_{j} \in PL$ , then $V\!F_{d_{i}} [j]=1$ otherwise $V\!F_{d_{i}}[j]=0$ , i.e.\begin{equation*} V\!F_{d_{i}}[j]= \begin{cases} 1,& \exists w_{P} \in d_{i}(w_{P} \in P_{j})\\ 0,& Else \end{cases}j \in \{1,2,\ldots,\tau \}\tag{11}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Definition 6 (Query Filtering Bit Vector (QFB-Vector)):

Given a query $Q$ with multiple keywords, the QFB-vector of $Q$ is $\tau $ -dimensional bit vector which is denoted as $V\!F_{Q}$ . If there is a keyword in $Q$ belongs to a partition $P_{i} \in PL$ , then $V\!F_{Q}[i]=1$ otherwise $V\!F_{Q}[i]=0$ , i.e.\begin{equation*} V\!F_{Q}[i]= \begin{cases} 1,& \exists w_{P} \in Q(w_{P} \in P_{i})\\ 0,& Else \end{cases}i \in \{1,2,\ldots,\tau \}\tag{12}\end{equation*} View SourceRight-click on figure for MathML and additional features.

According to Definition 5 and 6, we have that the DFB-vector indicates the involved partitions of the corresponding document while the QFB-Vector indicates the target partitions of a query. For example, if $V\!F_{d_{i}}[j]=1$ , then $P_{j}$ is a involved partition of $d_{i}$ , which means that $d_{i}$ is a covered document of $P_{j}$ . And if $V\!F_{Q}[j]=1$ , then $P_{j}$ is the target partition of the query $Q$ and $d_{i}$ is a candidate document of $Q$ . Therefore, we can deduce Observation 3 as follows.

Observation 3.

Given a query $Q$ , we have \begin{equation*} C\!Docs(Q)=\{d_{i} | V\!F_{d_{i}}\& V\!F_{Q} \neq \{0\}^\tau \wedge d_{i} \in D\}\tag{13}\end{equation*} View SourceRight-click on figure for MathML and additional features. where ”&” is the bitwise AND operator and $\{0\}^\tau $ represents a $\tau $ -dimensional zero bit vector.

Observation 3 indicates that, given a document $d_{i}$ , if the bitwise AND operation result between the DFB-vector of $d_{i}$ and the QFB-vector of $Q$ is not a zero bit vector, then $d_{i}$ is a candidate document of $Q$ . Therefore, the DFB-vectors and QFB-vector are the index for filtering out the candidate documents and speeding up the searches.

We also take Example 1 as an example where ten documents and the query $Q=\{w_{1}, w_{4}, w_{7}, w_{16}\}$ are assumed. According to Definition 5 and 6, the DFB-vectors of the documents and the QFB-vector of $Q$ are shown in FIGURE 3, which are all 5-dimensional vectors. According to Observation 3, we apply the bitwise AND operation between the DFB-vectors and the QFB-vector, then we have $C\!Docs(Q)=C\!D\!S(P_{1})\cup C\!D\!S(P_{2})=\{d_{1}, d_{2}, d_{4}, d_{9}\}$ which coincides the result of Definition 4.

FIGURE 3. - 
$V\!F_{D}$
 and 
$V\!F_{Q}$
.
FIGURE 3.

$V\!F_{D}$ and $V\!F_{Q}$ .

SECTION VI.

MRSE-HC Scheme

Based on the keyword partition vector model, we give the details procedures of our scheme. We first give the framework of MRSE-HC shown in FIGURE 4. It consists of two stages which are setup stage and search stage. Before providing ranked search services, DO performs the algorithms, $GenKey$ , $GenPatitions$ , $GenVector$ , $EncData$ and $StoreData$ , to set up the system. Once a query is applied, the search stage is performed by the algorithms, $GenTrapdoor$ , $GenQFBVector$ , $Filtering$ and $Searching$ . Detailed statements about the above algorithms are then given in the following two sub-sections.

FIGURE 4. - The framework of MRSE-HC.
FIGURE 4.

The framework of MRSE-HC.

A. Algorithms in Setup Stage

1) $SK \leftarrow GenKey(1^{l(n)})$

DO generates the secure key $S\!K=\{S, M_{1}, M_{2}, g\}$ where $S$ is a random generated $n$ -dimensional bit vector, $M_{1}$ and $M_{2}$ are random $n\times n$ -dimensional invertible matrices, and $g$ is the key for document encryption. $SK$ is shared by DO and DU.

2) $PL \leftarrow GenPartitions(W, \tau)$

This algorithm is given in Algorithm 1 in detail. It is on the basis of the bisecting $k$ -means clustering and the Normalized Google-Distance measurement. After performing the algorithm, keywords with high relevance in the keyword dictionary are clustered into partitions and the partition list $P\!L=\{P_{1}, P_{2}, \ldots, P_\tau \}$ is generated where $\tau $ is a threshold to control the number of output partitions.

3) $\{V_{D}, VF_{D}\} \leftarrow GenVectors(D, PL)$

For each $d_{i} \!\in \!D$ , DO generates the corresponding document vector $V_{d_{i}}$ according to Eq. (1), and then generates the corresponding DFB-vector $V\!F_{d_{i}}$ according to Definition 5. The sets of generated document vectors and DFB-vectors are $V_{D}=\{V_{d_{1}},V_{d_{2}},\ldots,V_{d_{m}}\}$ and $V\!F_{D}=\{V\!F_{d_{1}},V\!F_{d_{2}},\ldots,V\!F_{d_{m}}\}$ respectively. Here, the generated DFB-vectors will be utilized as the index in our scheme to filter out the candidate documents and speed the ranked searches.

4) $\{\tilde{D}, \tilde{V}_{D} \} \leftarrow EncData(D, V_{D}, SK)$

For each $d_{i} \in D$ and the corresponding document vector $V_{d_{i}}\in V_{D} $ , DO first encrypts $d_{i}$ into $\widetilde {d}_{i} $ by a symmetric encryption (such as DES, AES, et al.) with the secret key $g$ in $S\!K$ . Second, DO generates two random ${n}$ -dimensional vectors $\{V_{d_{i}}^{1},V_{d_{i}}^{2}\}$ according to the random bit vector $S$ in $S\!K$ . Specifically, if $S[j]=0$ , then $V_{d_{i}}^{1} [j]=V_{d_{i}}^{2}[j]=V_{d_{i}}[j]$ ; otherwise $V_{d_{i}}^{1}[j]=GenRand()$ and $V_{d_{i}}^{2}[j]=V_{d_{i}}[j]-V_{d_{i}}^{1}[j]$ where $GenRand()$ is a random value generator. Then, the encrypted document vector $\widetilde {V}_{d_{i}} $ is calculated,$\widetilde {V}_{d_{i}}=\{V_{d_{i}}^{1}\!M_{1}^{T},V_{d_{i}}^{2}\!M_{2}^{T}\} $ . Through the above operations, the encrypted documents $\widetilde {D}=\{\widetilde {d}_{1},\widetilde {d}_{2},\ldots,\widetilde {d}_{m}\}$ and the corresponding encrypted document vectors $\widetilde {V}_{D}=\{\widetilde {V}_{d_{1}},\widetilde {V}_{d_{2}},\ldots,\widetilde {V}_{d_{m}}\} $ are generated.

5) $StoreData(VF_{D}, \tilde{D}, \tilde{V}_{D})$

After the above steps, DO outsources the encrypted documents $\widetilde {D}$ and the corresponding encrypted document vectors $\widetilde {V}_{D}$ to Pub-Cloud. And then DO uploads the DFB-vectors $V\!F_{D}$ to Pri-Cloud. It is noticeable that the corresponding document IDs are both stored in Pub-Cloud and Pri-Cloud. Obviously, the shared data between Pub-Cloud and Pri-Cloud are only the document IDs.

After the five steps, the setup stage is finished and the system is prepared for the multi-keyword search over encrypted documents.

B. Algorithms in Search Stage

1) $\tilde{V}_{Q} \leftarrow GenTrapdoor(Q, SK)$

Once a query $Q$ with multi-keywords is applied, DU generates the query vector $V_{Q}$ according to Eq. (2) and two random $n$ -dimensional vectors $\{V_{Q}^{1}, V_{Q}^{2}\}$ according to the random bit vector $S$ in $SK$ . Specifically, if $S[i]=1$ , then $V_{Q}^{1} [i]=V_{Q}^{2}[i]=V_{Q}[i]$ ; otherwise $V_{Q}^{1}[i]=GenRand()$ and $V_{Q}^{2}[i]=V_{Q}[i]-V_{Q}^{1} [i]$ . Then, the encrypted query vector $\widetilde {V}_{Q}$ is calculated, $\widetilde {V}_{Q}=\{V_{Q}^{1}\!M_{1}^{-1},V_{Q}^{2}\!M_{2}^{-1}\}$ , which is the trapdoor of $Q$ and submitted to Pub-Cloud.

2) $VF_{Q} \leftarrow GenQFBVector(Q, PL)$

According to Definition 6, DU generates the QFB-vector of $Q$ , $V\!F_{Q}$ , which indicates the target partitions of $Q$ . And then DU transmits $V\!F_{Q}$ to Pri-Cloud.

3) $CID \leftarrow Filtering(VF_{Q}, VF_{D})$

Pri-Cloud utilizes the DFB-vectors as the index to filter out the candidate documents for the query. It performs the bitwise AND operation between each DFB-vector and the received QFB-vector and then find out the corresponding IDs of the candidate documents for the query $Q$ according to Observation 3. Specifically, for each $V\!F_{d_{i}} \in V\!F_{D}$ , if $V\!F_{d_{i}}\& V\!F_{Q}$ is not a zero-vector, then $Identity(d_{i})$ is added in the ID set $C\!I\!D$ . Here, $Identity(d_{i})$ is the ID of $d_{i}$ . After processing all the DFB-vectors of $V\!F_{D}$ , Pri-Cloud only transmits $C\!I\!D$ to Pub-Cloud. $CID$ is the only shared data between Pri-Cloud and Pub-Cloud. The complexity of this algorithm is $O(m\ast \tau)$ since there are $m\ast \tau $ bitwise AND operations.

4) $\Re \leftarrow Searching(\tilde{D}, \tilde{V}_{D}, \tilde{V}_{Q}, CID, k)$

Pub-Cloud computes the inner products between the trapdoor $\widetilde {V}_{Q}$ and the encrypted document vectors $\{\widetilde {V}_{d_{i}}| \widetilde {V}_{d_{i}} \in \widetilde {V}_{D} \wedge Identity(d_{i}) \in C\!I\!D \}$ . According to Lemma 1, the inner products between the trapdoor and the encrypted document vectors equal the inner products between the corresponding plaintext query vector and document vectors respectively, and the inner products represent the relevance scores between the queried keywords and the documents according to the relevance score measurement in the Preliminaries Section. The ranked $k$ encrypted documents with the highest $k$ inner products are the result encrypted documents R which is returned to DU. The complexity of this algorithm is $O(|C\!I\!D|*n + k*|C\!I\!D|)\approx O(|C\!I\!D|*n)$ since $k \ll n$ holds generally. $|X|$ represents the item number of the set $X$ .

At last, when DU receives the result encrypted documents $\Re $ from Pub-Cloud, it uses the shared secure key to decrypts the encrypted documents and get the plaintext result documents.

Lemma 1:

$\widetilde {V}_{Q} \cdot \widetilde {V}_{d_{i}} =V_{Q} \cdot V_{d_{i}} $

Proof:

\begin{align*} \widetilde {V}_{Q} \cdot \widetilde {V}_{d_{i}}=&\{V_{Q}^{1}\!M_{1}^{-1},V_{Q}^{2}\!M_{2}^{-1}\}\cdot \{V_{d_{i}}^{1}\!M_{1}^{T},V_{d_{i}}^{2}\!M_{2}^{T}\}^{T} \\=&(V_{Q}^{1}\!M_{1}^{-1})\cdot (V_{d_{i}}^{1}\!M_{1}^{T})^{T} + (V_{Q}^{2}\!M_{2}^{-1})\cdot (V_{d_{i}}^{2}\!M_{2}^{T})^{T}\\=&V_{Q}^{1}\!M_{1}^{-1}\!M_{1}(V_{d_{i}}^{1})^{T} +V_{Q}^{2}\!M_{2}^{-1}\!M_{2}(V_{d_{i}}^{2})^{T}\\=&V_{Q}^{1}(V_{d_{i}}^{1})^{T}+V_{Q}^{2}(V_{d_{i}}^{2})^{T}\\=&V_{Q} \cdot V_{d_{i}}\end{align*} View SourceRight-click on figure for MathML and additional features.

In the above algorithms, DU transforms the query request into two vectors, one is the trapdoor (the encrypted query vector) generated in $GenTrapdoor$ and the other is the QFB-vector generated in $GenQFBVector$ . The latter is submitted to Pri-Cloud and Pri-Cloud uses it and the DFB-vectors of documents to filter out the candidate document IDs in $Filtering$ . According to the received candidate document IDs, Pub-Cloud determines the result encrypted documents and return them to DU in $Searching$ . Since the searching space is shrunk in the candidate documents, the ranked search efficiency is improved. We will give the performance evaluation on search efficiency in Section VIII.

SECTION VII.

The Enhanced Scheme

In this section, we propose the enhanced scheme EMRSE-HC which is designed to improve the efficiency of MRSE-HC. EMRSE-HC utilizes the complete binary pruning tree (CBP-Tree) instead of the sequential DFB-vectors and the corresponding CBP-Tree based filtering algorithm is adopted. Compared with MRSE-HC, EMRSE-HC adds the CBP-Tree construction algorithm $BuildCBPT$ and updates the $StoreData$ and $Flitering$ algorithms. The three algorithms are briefly introduced as follows.

  • $\mathcal {I}\leftarrow BulidCBPT(D)$ . In EMRSE-HC scheme, the CBP-Tree, denoted as $\mathcal {I}$ , is constructed as index. Each node in $\mathcal {I}$ stores the DFB-vector and the pruning vector, the latter of which is used for improving the search efficiency. Details of this algorithm is illustrated in Section VII.A.

  • $StoreData(\mathcal {I},\widetilde {D},\widetilde {V}_{D})$ . In EMRSE-HC scheme, $\mathcal {I}$ is stored in Pri-Cloud. And the storage of $\widetilde {D}$ and $\widetilde {V}_{D}$ are the same as MRSE-HC.

  • $C\!I\!D\leftarrow Filtering(\mathcal {I}, i, V\!F_{Q})$ . In EMRSE-HC scheme, the $Filtering$ algorithm utilizes the CBP-Tree to filter out the unqualified documents efficiently and generates the candidate documents. Details of the updated $Filtering$ algorithm is shown in Section VII.B.

A. CBP-Tree Construction Algorithm

The CBP-Tree is the index of EMRSE-HC, which is a complete binary tree. Each node corresponds to a document. The data structure of the node is defined as:\begin{equation*} < docI\!D,df\!v,pv>,\tag{14}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $docID$ and $dfv$ are the identity and DFB-vector of a document respectively, $pv$ is the pruning vector of the node which is used for filtering out unqualified nodes.

According to [32], we know that an array is an appropriate structure to store a complete binary tree. Thus, we utilize the array $\mathcal {I}[1,2,\ldots,m]$ to represent the CBP-Tree. The conclusion of [32] show that for the node $\mathcal {I}[i]$ , if $i=1$ , $\mathcal {I}[i]$ is the root node; if $i>1$ , the parent node of $\mathcal {I}[i]$ is denoted as $\mathcal {I}[\lfloor i/2 \rfloor]$ ; if the left and right children of $\mathcal {I}[i]$ are $\mathcal {I}[2i]$ and $\mathcal {I}[{2i+1}]$ respectively if they exist. The CBP-Tree construction algorithm is shown in Algorithm 2.

Algorithm 2 $BulidC\!B\!P\!T(D,\mathcal{I})$

Input:

The document collection $D$ .

Output:

CBP-Tree $\mathcal {I}[1,2,\ldots,m]$ where $\mathcal {I}\![i]$ is a node of the tree.

1:

for $i=1,2,\ldots,m$ do

2:

Settle the node $\mathcal {I}[i]$ where $\mathcal {I}\![i].docI\!D=id(d_{i}),\,\, \mathcal {I}\![i].pv=\mathcal {I}\![i].dfv=V\!F_{d_{i}}$ ;

3:

end for

4:

for $i=\lceil (m-1)/2 \rceil,\ldots,1$ do

5:

if $2i+1 \leq n $ then

6:

$\mathcal {I}\![i].pv=\mathcal {I}[i].pv \, \boldsymbol {|} \, \mathcal {I}\![2i].pv \, \boldsymbol {|} \, \mathcal {I}\![{2i+1}].pv$ ;

7:

else

8:

$\mathcal {I}\![i].pv=\mathcal {I}[i].pv \, \boldsymbol {|} \, \mathcal {I}\![2i].pv$ ;

9:

end if

10:

end for

10:

return $\mathcal {I}$

In Algorithm 2, if the node $\mathcal {I}\![i]$ is a leaf node, its pruning vector is set to equal the DFB-vector of the node. If $\mathcal {I}\![i]$ is an internal node, the pruning vector of $\mathcal {I}\![i]$ is generated by the bitwise OR operation, depending on whether it has right child node. Since the CBP-Tree is a complete binary tree with $m$ nodes, the height is $\lceil log_{2}(m+1)\rceil $ which is the shortest of the binary trees with $m$ nodes.

We take the same example as Example 1 to illustrate Algorithm 2. The DFB-vectors of documents in Example 1 is shown in FIGURE 4. Taking those DFB-vectors as the input of Algorithm 2 and the constructed CBP-Tree is shown in FIGURE 5.

FIGURE 5. - CBP-Tree.
FIGURE 5.

CBP-Tree.

B. CBP-Tree Based Filtering Algorithm

DO stores the CBP-Tree to Pri-Cloud. Pri-Cloud utilizes the pruning vectors of nodes in the CBP-Tree to prune unqualified subtrees. The CBP-Tree based Filtering algorithm is shown in Algorithm 3.

Algorithm 3 ${Filtering(\mathcal{I}, i, V\!F_{Q}, C\!I\!D)}$

Input:

$\mathcal {I}[i]$ is the current processed node, $V\!F_{Q}$ is the QFB-vector.

Output:

$C\!I\!D$ is the candidate documents ID collection.

1:

if $i \leq n$ then

2:

if $2i>n$ then

3:

if $\mathcal {I}[i].dfv \,\& \,V\!F_{Q}\neq 0$ then

4:

Add $\mathcal {I}[i].docID$ into $CID$ ;

5:

end if

6:

else

7:

if $\mathcal {I}[i].pv \, \& \, V\!F_{Q}\neq 0$ then

8:

if $\mathcal {I}[i].dfv \, \& \, V\!F_{Q}\neq 0$ then

9:

Add $\mathcal {I}[i].docID$ into $CID$ ;

10:

end if

11:

$Filtering(\mathcal {I}, 2i, V\!F_{Q}, C\!I\!D)$ ;

12:

$Filtering(\mathcal {I}, 2i+1, V\!F_{Q}, C\!I\!D)$ ;

13:

end if

14:

end if

15:

end if

Algorithm 3 is a recursion algorithm that starts with the root node. Algorithm 3 can find all candidate nodes which are coinciding the result of bitwise AND operation between its DFB-vector and QFB-vector is not equal to 0. DFB-vector in each node is utilized to early prune unqualified nodes. The pruning vector is used to improve the filter efficiency. Because the height of the CBP-Tree is $log_{2}m$ and the unqualified nodes are early filtered out, the complexity of this algorithm is $O(\tau \ast log_{2}m)$ .

The enhanced $Filtering$ algorithm improves the filtering efficiency and promotes the search efficiency. Additionally, we give the detailed performance evaluation in Section VIII.

C. Security Enhancement

In the search stage, the Pub-Cloud generates different trapdoors when the same queries are applied but the candidate documents and the calculated relevance scores are the same. Pub-Cloud could use those convert channels to link the same search requests and deduce the hot keywords with high frequency in documents. To overcome this problem, it is a practical and effective countermeasure to extend dimension by adding some phantom terms into vectors to break such convert channels [5], [6], [11], [13]. We also introduce this method to enhance the security of our scheme and to protect the document confidentiality, index and trapdoor privacy and trapdoor unlinkability.

We adopt the similar phantom item addition method of [5], [6], [11], [13] for providing trapdoor unlinkability. The brief idea is introduced as follows. First, DO randomly generates two $(n+L+1)$ -dimensional vectors by $S'$ and two $(n+L+1)\times (n+L+1)$ -dimensional invertible matrices $\{M_{1}',M_{2}'\}$ . Then, the document vectors and query vectors are extended to $(n+L+1)$ -dimensional vector where $L$ is the number dummy keywords inserted. The $(n+j+1)$ -th entry in the extended document vector $V_{d_{i}}'$ is set to a random number $\varepsilon _{j}$ where $j \in \{1,2,\ldots,L\}$ and the $(n+L+1)$ -th is set to 1. Finally, by randomly selecting $G$ out of $L$ dummy keywords and the corresponding values in the extended query vector $V_{Q}'$ are set to 1 and then $V_{Q}'$ multiple by a random parameter $r$ , the $(n+L+1)$ -th is set to a random entry $\lambda $ . The above extended vectors are used for computing the ranked encrypted documents. Due to the adoption of the secure inner product operation, the phantom items are hidden in the encrypted extended vectors and it is hard to distinguish between the phantom items and the real items. What is noteworthy is that some added phantom terms will slightly decrease the accuracy of the query result but the trapdoor unlinkability is preserved.

Adding the phantom item in the document vectors and the query vectors affects the accuracy of the relevance score and the query results. But the trapdoor unlinkability is preserved. In addition, the standard deviation parameter $\sigma $ could be adjusted to balance the accuracy and the trapdoor unlinkability. Analysis of the effects of adding phantom items can be referred in [6], [11].

D. Security Analysis

In this section, we analyze the EMRSE-HC scheme according to the three privacy demands.

1) Document Confidentiality

In the scheme, the documents and the corresponding vectors are encrypted and then outsourced in Pub-Cloud. The encryption procedures are given in the $EncData$ algorithm. Documents are encrypted by a symmetric encryption algorithm (such as AES) with the secret key $g$ in $S\!K$ while the corresponding vectors are encrypted by the secure inner product operation with one random bit vector and two random invertible matrix in $S\!K$ . Since $S\!K$ generated by DO is only shared with the authorized DU, Pub-Cloud has no idea of those secret keys in $S\!K$ . Thus, it is computation infeasible for Pub-Cloud to obtain the plaintext information from the encrypted documents and vectors and document confidentiality is guaranteed.

2) Index and Trapdoor Privacy

In the scheme, the index is the CBP-Tree stored in Pri-Cloud which is isolated from Pub-Cloud. Trapdoors are generated by performing secure inner product operation on the query vectors with the secret key SK which is only shared by DO and DU. Additionally, several random items are added in the trapdoor generation process, which increases the randomness of values in vectors. Without the secret key and the phantom item addition method, it is hard for Pub-Cloud to get the plaintext query vectors. As a result, the scheme can protect the privacy of the index and the trapdoor.

3) Trapdoor Unlinkability

In this scheme, the probability that $GenTrapdoor$ algorithm generates the same trapdoor by the same query is extremely low and can be considered negligible. We analyze the reason under the known background model. According to the phantom item addition procedures in Section VII.C, the original query vector is extended into $n+L+1$ -dimensional, has $n+L+1$ bits after adding $L+1$ dimensions in the o for adding phantom items. $r$ is the noise parameter for enhancing trapdoors by performing multiplication between $r$ and each dimension of trapdoor. The length of $r$ is denoted as $l_{r}$ . The former $n$ bits of the extended trapdoor has $2^{n}$ possibilities.

The possibility of the same query generates the same trapdoor $p=\frac {1}{x \times 2^{l_{r}}}$ , and the larger the denominator, the smaller the $p$ . In other words, the bigger the parameters, the smaller the probability of generating the same trapdoor, the safer the scheme. For example, we assume $r$ is a 1024 bit parameter, then we have $p \lt {\frac {1}{2^{1024}}}$ which is considered negligible. Therefore, trapdoor is unlinkability in this scheme.

SECTION VIII.

Performance Evaluation

In this section, we evaluate the performance of our proposed basic scheme MRSE-HC, efficiency enhanced scheme EMRSE-HC and compare them with the scheme presented in [26] which is denoted as FMRS. We implement MRSE-HC, EMRSE-HC and FMRS and perform the evaluations on the search time cost on the real data set of NSF Research Award Abstracts provided by UCI [33]. The real dataset includes about 129000 abstracts. We use IK Analyzer [34] to extract the keywords of documents and then process the extracted keywords.

The experimental hardware environment is INTEL Core(TM) I5-8250 CPU, 4G memory, and 588G hard disk; and software environment is Eclipse development platform. Since the queried keywords are usually relevant with each other in practice, we assume that the default number of target partitions of a query is one. Other default parameters are summarized in Table 1 where $m$ , $n$ , $\tau $ , $t$ and $k$ are the number of documents, keywords in the dictionary, clustered partitions, queried keywords and request documents respectively.

TABLE 1 Default Values of Parameters
Table 1- 
Default Values of Parameters

In the following experiments, we evaluate the time cost of searches where one of the above parameters changes and the other parameters adopt the default values. The results are shown in FIGURE 6–​10.

FIGURE 6. - Search time cost vs 
$\tau $
.
FIGURE 6.

Search time cost vs $\tau $ .

FIGURE 7. - Search time cost vs 
$k$
.
FIGURE 7.

Search time cost vs $k$ .

FIGURE 8. - Search time cost vs 
$t$
.
FIGURE 8.

Search time cost vs $t$ .

FIGURE 9. - Search time cost vs 
$m$
.
FIGURE 9.

Search time cost vs $m$ .

FIGURE 10. - Search time cost vs 
$n$
.
FIGURE 10.

Search time cost vs $n$ .

FIGURE 6–​10 all show that the proposed MRSE-HC and EMRSE-HC both outperform FMRS in the time cost of ranked searches. Among them, EMRSE-HC is the most efficiently. On average, EMRSE-HC saves about 20% and 80% of the time cost respectively. The reason is that: (1) The target partitions of MRSE-HC and EMRSE-HC are usually less than FMRS when a query is applied because the partitions generated in MRSE-HC or EMRSE-HC is clustered according to the relevance between keywords and the queried keywords are usually relevant to each other. The less are the target partitions, the less are the candidate documents determined, which improves the search efficiency. (2) In EMRSE-HC, the tree index, CBP-Tree is adopted and a large number of irrelevant documents are filtered out. Thus, the search efficiency is improved.

We analyze the impacts of the changes of $\tau $ , $k$ , $t$ , $m$ and $n$ on the search time cost of MRSE-HC, EMRSE-HC and FMRS as follows.

  1. FIGURE 6 indicates that, as $\tau $ grows up, the time cost of MRSE-HC, EMRSE-HC and FRMS all decrease. The reason is that, when $\tau $ grows up, more partitions are generated, and the keywords in each partition and the covered documents of partitions both decrease. Since the target partitions are settled constantly, the total covered documents of the target partitions, which are the candidate documents, decrease simultaneously. Thus, the time cost of both schemes decreases.

  2. FIGURE 7 shows that the increase of the number of request documents $k$ have little impact on the time cost of MRSE-HC, EMRSE-HC and FMRS. The reason is that the search conditions (such as partitions, vectors and documents) remain unchanged as $k$ growing up, thus the candidate documents in both schemes are still the same and the time cost of both schemes just randomly oscillate around a certain level.

  3. FIGURE 8 indicates that as the number of queried keywords $t$ grows up, the time cost of FMRS increases while the time cost of MRSE-HC and EMRSE-HC just have random oscillations. The reason is that keywords in FMRS are equally divided and randomly distributed in partitions while keywords in MRSE-HC and EMRSE-HC are clustered with high relevance in partitions. When $t$ grows up, the number of target partitions proportionally increases in FMRS. But it has slight changes in MRSE-HC and EMSE-HC. Since the corresponding candidate documents have proportional relations with the target partitions, the time cost of FMRS increases while the time cost of MRSE-HC and EMRSE-HC just have slight changes.

  4. FIGURE 9 indicates that the time cost of MRSE-HC, EMRSE-HC and FMRS all increase as $m$ grows up. The reason is that, when the number of documents $m$ increases, the average number of covered documents of each partition grows up simultaneously. Since the target partitions of both schemes do not change, the total covered documents of the target partitions increase. Thus the candidate documents increase which consumes more time cost to perform the ranked searches.

  5. FIGURE 10 shows that, as $n$ grows up, the time cost of MRSE-HC, EMSE-HC and FMRS increase. The reason is that the average number of keywords in each partition grows up as the dictionary capacity $n$ increases, which causes the covered documents of each partition increase. Since the target partitions of both schemes have no changes, the candidate documents (which is the covered documents of the target partitions) increase simultaneously. In addition, the dimensions of document vectors and query vectors are both enlarged when $n$ grows up. It will consume more time in the relevance score calculations. Thus, all schemes consume more time to accomplish ranked searches.

  6. FIGURE 11 shows that the space cost of the indexes in MRSE-HC, EMRSE-HC and FMRS all increase as $\tau $ grows up. The reason is that, when $\tau $ grows up, more partitions are generated and the dimensions of document filtering bit vectors increase simultaneous. The space cost of indexes in MRSE-HC and FMRS are the same. The reason is that the keywords in MRSE-HC and FMRS are the same thus the vectors in the indexes of them are also the same. In addition, the space cost of index in EMRSE-HC is about twice as it in MRSE-HC. The reason is that the index in EMRSE-HC is a complete binary tree, where the total nodes are nearly twice as the leaf nodes. Here, the leaf nodes store the same vectors of MRSE-HC and the internal nodes store the same structure of vectors as the leaf nodes.

FIGURE 11. - Space cost of index vs 
$\tau $
.
FIGURE 11.

Space cost of index vs $\tau $ .

SECTION IX.

Conclusion

It is still a challenge to ensure the efficiency of the search under the premise of ensuring the accuracy of the search results. And most of the existing multi-keyword ranked search over encrypted data are for the public cloud. In this paper, we propose a privacy-preserving Multi-Keyword Ranked Search over Encrypted data in hybrid clouds,which is denoted as MRSE-HC. In our scheme, the keyword dictionary of documents are balancing clustered in partitions by the keyword partition algorithm on the basis of the bisecting $k$ -means clustering. Based on the generated keywords partitions, construct the DFB-vector indicates the involved partitions of the corresponding document and the QFB-Vector indicates the target partitions of a query. And then Pri-Cloud uses the QFB-vector and DFB-vectors to find the candidate document IDs which can efficiently cut most irrelevant documents. Finally, Pub-Cloud uses the IDs and the query trapdoor to determine the result encrypted documents and returns them to users. Besides, we utilize secure inner product algorithm against two threat models. The experimental results show that the scheme proposed in this paper has better performance in terms of efficiency compared with the existing methods.

ACKNOWLEDGMENT

Compared with the preliminary version [1].

References

References is not available for this document.