This paper appears in: , Issue Date: , Written by:

© 2014 IEEE

SECTION I

Mobile cloud computing [1] [2] [3] [4] gets rid of the hardware limitation of mobile devices by exploring the scalable and virtualized cloud storage and computing resources, and accordingly is able to provide much more powerful and scalable mobile services to users. In mobile cloud computing, mobile users typically outsource their data to external cloud servers, e.g., iCloud, to enjoy a stable, low-cost and scalable way for data storage and access. However, as outsourced data typically contain sensitive privacy information, such as personal photos, emails, etc., which would lead to severe confidentiality and privacy violations [5], if without efficient protections. It is therefore necessary to encrypt the sensitive data before outsourcing them to the cloud. The data encryption, however, would result in salient difficulties when other users need to access interested data with search, due to the difficulties of search over encrypted data. This fundamental issue in mobile cloud computing accordingly motivates an extensive body of research in the recent years on the investigation of searchable encryption technique to achieve efficient searching over outsourced encrypted data [6] [7] [8] [9].

A collection of research works have recently been developed on the topic of multi-keyword search over encrypted data. Cash et al. [10] propose a symmetric searchable encryption scheme which achieves high efficiency for large databases with modest scarification on security guarantees. Cao et al. [11] propose a multi-keyword search scheme supporting result ranking by adopting $k$-nearest neighbors (kNN) technique [12]. Naveed et.al. [13] propose a dynamic searchable encryption scheme through blind storage to conceal access pattern of the search user.

In order to meet the practical search requirements, search over encrypted data should support the following three functions. First, the searchable encryption schemes should support multi-keyword search, and provide the same user experience as searching in Google search with different keywords; single-keyword search is far from satisfactory by only returning very limited and inaccurate search results. Second, to quickly identify most relevant results, the search user would typically prefer cloud servers to sort the returned search results in a relevance-based order [14] ranked by the relevance of the search request to the documents. In addition, showing the ranked search to users can also eliminate the unnecessary network traffic by only sending back the most relevant results from cloud to search users. Third, as for the search efficiency, since the number of the documents contained in a database could be extraordinarily large, searchable encryption schemes should be efficient to quickly respond to the search requests with minimum delays.

In contrast to the theoretical benefits, most of the existing proposals, however, fail to offer sufficient insights towards the construction of full functioned searchable encryption as described above. As an effort towards the issue, in this paper, we propose an efficient multi-keyword ranked search (EMRS) scheme over encrypted mobile cloud data through blind storage. Our main contributions can be summarized as follows:

- We introduce a relevance score in searchable encryption to achieve multi-keyword ranked search over the encrypted mobile cloud data. In addition to that, we construct an efficient index to improve the search efficiency.
- By modifying the blind storage system in the EMRS, we solve the trapdoor unlinkability problem and conceal access pattern of the search user from the cloud server.
- We give thorough security analysis to demonstrate that the EMRS can reach a high security level including confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Moreover, we implement extensive experiments, which show that the EMRS can achieve enhanced efficiency in the terms of functionality and search efficiency compared with existing proposals.

The remainder of this paper is organized as follows. In Section II, the system model, security requirements and design goal are formalized. In Section III, we recap relevance scoring, secure kNN technique, blind storage system and ciphertext policy attribute-based encryption. In Section IV, we propose the EMRS. Its security analysis and performance evaluation are presented in Section V and Section VI, respectively. In Section VII, we present related work. Finally, we conclude this paper in Section VIII.

SECTION II

As shown in Fig. 1, the system model in the EMRS consists of three entities: data owner, search users and cloud server. The data owner keeps a large collection of documents $D$ to be outsourced to a cloud server in an encrypted form $C$. In the system, the data owner sets a keyword dictionary $W$ which contains $d$ keywords. To enable search users to query over the encrypted documents, the data owner builds the encrypted index $\digamma$. Both the encrypted documents $C$ and encrypted index $\digamma$ are stored on the cloud server through blind storage system.

When a search user wants to search over the encrypted documents, she first receives the secret key from the data owner. Then, she chooses a conjunctive keyword set $\varpi$ which contains $l$ interested keywords and computes a trapdoor $T$ including a keyword-related token $stag$ and the encrypted query vector $Q$. Finally, the search user sends $stag$, $Q$, and an optional number $k$ to the cloud server to request the most $k$ relevant results.

Upon receiving $stag$, $Q$, and $k$ from the search user, the cloud server uses the $stag$ to access the index $\digamma$ in the blind storage and computes the relevance scores with the encrypted query vector $Q$. Then, the cloud server sends back descriptors $(Dsc)$ of the top-k documents that are most relevant to the searched keywords. The search user can use these descriptors to access the blind storage system to retrieve the encrypted documents. An access control technique, e.g., attribute-based encryption, can be implemented to manage the search user’s decryption capability.

In the EMRS, we consider the cloud server to be curious but honest which means it executes the task assigned by the data owner and the search user correctly. However, it is curious about the data in its storage and the received trapdoors to obtain additional information. Moreover, we consider the $Knowing~Background$ model in the EMRS, which allows the cloud server to know more background information of the documents such as statistical information of the keywords. Specifically, the EMRS aims to provide the following four security requirements:

*Confidentiality of Documents and Index:*Documents and index should be encrypted before being outsourced to a cloud server. The cloud server should be prevented from prying into the outsourced documents and cannot deduce any associations between the documents and keywords using the index.*Trapdoor Privacy:*Since the search user would like to keep her searches from being exposed to the cloud server, the cloud server should be prevented from knowing the exact keywords contained in the trapdoor of the search user.*Trapdoor Unlinkability:*The trapdoors should not be linkable, which means the trapdoors should be totally different even if they contain the same keywords. In other words, the trapdoors should be randomized rather than determined. The cloud server cannot deduce any associations between two trapdoors.*Concealing Access Pattern of the Search User:*Access pattern is the sequence of the searched results. In the EMRS, the access pattern should be totally concealed from the cloud server. Specifically, the cloud server cannot learn the total number of the documents stored on it nor the size of the searched document even when the search user retrieves this document from the cloud server.

To enable efficient and privacy-preserving multi-keyword ranked search over encrypted mobile cloud data via blind storage system, the EMRS has following design goals:

*Multi-Keyword Ranked Search:*To meet the requirements for practical uses and provide better user experience, the EMRS should not only support multi-keyword search over encrypted mobile cloud data, but also achieve relevance-based result ranking.*Search Efficiency:*Since the number of the total documents may be very large in a practical situation, the EMRS should achieve sublinear search with better search efficiency.*Confidentiality and Privacy Preservation:*To prevent the cloud server from learning any additional information about the documents and the index, and to keep search users’ trapdoors secret, the EMRS should cover all the security requirements that we introduced above.

SECTION III

In searchable symmetric encryption (SSE) schemes, due to a large number of documents, search results should be retrieved in an order of the relevancy with the searched keywords. Scoring is the natural way to weight the relevancy of the documents. Among many relevance scoring techniques, we adopt $TF$-$IDF$ weighting [15] in the EMRS. In $TF$-$IDF$ weighting, term frequency $tf_{t,\,f}$ refers to the number of term $t$ in a document $f$. Inverse document frequency is calculated as $idf_{t}=log\frac {N}{df_{t}}$, where $df_{t}$ denotes the number of documents which contain term $t$ and $N$ refers to the total number of documents in the database. Then, the weighting of term $t$ in a document $f$ can be calculated as $tf_{t,f}*idf_{t}$.

We adopt the work of Wong et al. [12] in the EMRS. Wong et al. propose a secure $k$-nearest neighbor (kNN) scheme which can confidentially encrypt two vectors and compute Euclidean distance of them. First, the secret key $(S,M_{1},M_{2})$ should be generated. The binary vector $S$ is a splitting indicator to split plaintext vector into two random vectors, which can confuse the value of plaintext vector. And $M_{1}$ and $M_{2}$ are used to encrypt the split vectors. The correctness and security of secure kNN computation scheme can be referred to [12].

A blind storage system [13] is built on the cloud server to support adding, updating and deleting documents and concealing the access pattern of the search user from the cloud server. In the blind storage system, all documents are divided into fixed-size blocks. These blocks are indexed by a sequence of random integers generated by a document-related seed. In the view of a cloud server, it can only see the blocks of encrypted documents uploaded and downloaded. Thus, the blind storage system leaks little information to the cloud server. Specifically, the cloud server does not know which blocks are of the same document, even the total number of the documents and the size of each document. Moreover, all the documents and index can be stored in the blind storage system to achieve a searchable encryption scheme.

In ciphertext policy attribute-based encryption (CP-ABE) [16], ciphertexts are created with an access structure (usually an access tree) which defines the access policy. A user can decrypt the data only if the attributes embedded in his attribute keys satisfy the access policy in the ciphertext. In CP-ABE, the encrypter holds the ultimate authority of the access policy.

SECTION IV

In this section, we propose the detailed EMRS. Since the encrypted documents and index $\digamma$ are both stored in the blind storage system, we would provide the general construction of the blind storage system. Moreover, since the EMRS aims to eliminate the risk of sharing the key that is used to encrypt the documents with all search users and solve the trapdoor unlinkability problem in Naveed’s scheme [13], we modify the construction of blind storage and leverage ciphertext policy attribute-based encryption (CP-ABE) technique in the EMRS. However, specific construction of CP-ABE is out of scope of this paper and we only give a simple indication here. The notations of this paper is shown in Table 1. The EMRS consists of the following phases: System Setup, Construction of Blind Storage, Encrypted Database Setup, Trapdoor Generation, Efficient and Secure Search, and Retrieve Documents from Blind Storage.

The data owner takes a security parameter $\lambda$, and outputs two invertible matrixes $M_{1}, M_{2} \in R^{(d+2)*(d+2)}$ as well as a ($\text{d}+2$)-dimension binary vector $S$ as the secret key, where $d$ represents the size of the keyword dictionary. Then, the data owner generates a set of attribute keys $sk$ for each search user according to her role in the system. The data owner chooses a key $K_{T}$ for a symmetric cryptography $Enc()$, e.g., AES. Finally, the data owner sends $(M_{1}, M_{2}, S, sk, Enc(), K_{T})$ to the search user through a secure channel.

The data owner chooses a full-domain collusion resistant hash function $H$, a full-domain pseudorandom function $\Psi$, a pseudorandom generator $\Gamma$ and a hash function $\Phi \!:\! \{0,1\}^{*} \!\!\rightarrow \! \{0,1\}^{192}$. $\Psi$ and $\Gamma$ are based on the AES block-cipher [13]. Then, the data owner chooses a number $\alpha >1$ that defines the expansion parameter and a number $\kappa$ that denotes the minimum number of blocks in a communication.

The data owner generates a key $K_{\Psi }$ for the function ${\Psi }$ and sends it to the search user using a secure channel.

This phase takes into a large collection of documents $D$. $D$ is a list of documents $(d_{1}, d_{2}, d_{3} \cdots d_{m})$ containing $m$ documents. where each document has a unique id denoted as $id_{i}$. The B.Build outputs an array of blocks $B$, which consists of $n_{b}$ blocks of $m_{b}$ bits each. For document $d_{i}$, it contains $size_{i}$ blocks of $m_{b}$ bits each and each header of these blocks contains the $H(id_{i})$. In addition, the header of the first block of the document $d_{i}$ indicates the size of $d_{i}$. At the beginning, we initialize all blocks in $B$ with all 0. For each document $d_{i}$ in $D$, we construct the blind storage as follows:

*Step 1:* Compute the seed $\sigma _{i}=\Psi _{K_\Psi }(id_{i})$ as the input of the function $\Gamma$. Generate a sufficiently long bit-number through the function $\Gamma$ using the seed $\sigma _{i}$ and parse it as a sequence of integers in the range $[n_{b}]$. Let $\pi [\sigma _{i},l]$ denote the first $l$ integers of this sequence. Generate a set $S_{f}=\pi [\sigma _{i}, \max (\lceil \alpha * {size_{i}}\rceil,\kappa )]$.

*Step 2:* Let ${S^{0}_{f}}=\pi [\sigma _{i},\kappa ]$, then check if the following conditions hold:

- There exists $size_{i}$ free blocks indexed by the integers in the set $S_{f}$.
- There exists one free block indexed by the integers in the set ${S^{0}_{f}}$.

If either of the above two does not hold, abort.

*Step 3:* Pick a subset ${S^{\prime }_{f}} \subset {S_{f}}$ that contains $size_{i}$ integers, and make sure that the blocks indexed by these integers in the subset ${S^{\prime }_{f}}$ are all free. We would rely on the fact that integers in the set $S_{f}$ are in a random order and we pick the first $size_{i}$ integers indexing free blocks and make these integers form the subset ${S^{\prime }_{f}}$. Mark these blocks as unfree. Then, write the document $d_{i}$ to the blocks indexed by the integers in ${S^{\prime }_{f}}$ in an increasing order.

Note that, one can once write the blocks of different documents to the blind storage system to conceal the associations of the blocks. Moreover, the specific construction of each block and the encryption of the blocks would be discussed next.

The main idea of the blind storage system is that storing a document in a set of fixed-size blocks indexed by the integers, that are generated by applying the seed $\sigma _{i}$ to the pseudorandom generator $\Gamma$. To reduce the probability that the number of free blocks indexed by integers in $S_{f}$ is less than $size_{i}$, we can choose a sequence of $\alpha \ast size_{i}$ integers as the set $S_{f}$. Here the choice of the parameter $\alpha$ is an inherent tension between collision probability and the wasted space. And the probability the above two conditions in Step 2 do not hold may be negligible by the choice of the parameters [13]. And we would prove it in Section V.

The data owner builds the encrypted database as follows:

*Step 1:* The data owner computes the d-dimension relevance vector $p=(p_{1}, p_{2}, \cdots p_{d})$ for each document using the $TF$-$IDF$ weighting technique, where $p_{j}$ for $j\in (1,2\cdots d)$ represents the weighting of keyword $\omega _{j}$ in document $d_{i}$. Then, the data owner extends the $p$ to a ($\text{d}+2$)-dimension vector $p^{*}$. The ($\text{d}+1$)-th entry of $p^{*}$ is set to a random number $\varepsilon$ and the ($\text{d}+2$)-th entry is set to 1. We would let $\varepsilon$ follow a normal distribution $N(\mu,\sigma ^{2})$ [11]. For each document $d_{i}$, to compute the encrypted relevance vector, the data owner encrypts the associated extended relevance vector $p^{*}$ using the secret key $M_{1}$, $M_{2}$ and $S$. First, the data owner chooses a random number $r$ and splits the extended relevance vector $p^{*}$ into two ($\text{d}+2$)-dimension vectors $p^{\prime }$ and $p^{\prime \prime }$ using the vector $S$. For the j-th item in $p^{*}$, set TeX Source$$\begin{equation} \begin{cases} p^{\prime }_{j}=p^{\prime \prime }_{j}=p^{*}_{j},~\quad if~ S_{j}=1\\ p^{\prime }_{j}=\frac {1}{2}p^{*}_{j}+r, {}\quad p^{\prime \prime }_{j}=\frac {1}{2}p^{*}_{j}-r,\quad otherwise \\ \end{cases} \end{equation}$$ where $S_{j}$ represents the j-th item of $S$. Then compute the $P=\{M_{1}^{T}\cdot p^{\prime },M_{2}^{T}\cdot p^{\prime \prime }\}$ as the encrypted relevance vector.

*Step 2:* For each document $d_{i}$ in $D$, set the document into blocks of $m_{b}$ bits each. For each block, there is a header $H(id_{i})$ indicating that this block belongs to document $d_{i}$. And the $size_{i}$ of the document is contained in the header of the first block of $d_{i}$. Then, for each document $d_{i}$, the data owner chooses a 192-bit key $K_{i}$ for the algorithm $Enc()$. More precisely, for each block $B[j]$ of the document $d_{i}$, where $j$ represents the index number of this block, compute the $K_{i} \oplus \Phi (j)$ as the key for the encryption of this block. Since each block has a unique index number, the blocks of the same document are encrypted with different keys. The document $d_{i}$ contains $size_{i}$ encrypted blocks and the first block of the document $d_{i}$ with index number $j$ is as TeX Source$$\begin{equation} Enc_{(K_{i} \oplus \Phi (j))}(H(id_{i})||size_{i}||data) \end{equation}$$ And the rest of the blocks of $d_{i}$ is as TeX Source$$\begin{equation} Enc_{(K_{i} \oplus \Phi (j))}(H(id_{i})||data) \end{equation}$$ Finally, the data owner encrypts all the documents and writes them to the blind storage system using the B.Build function.

*Step 3:* To enable efficient search over the encrypted documents, the data owner builds the index $\digamma$. First, the data owner defines the access policy $\upsilon _{i}$ for each document $d_{i}$. We denote the result of attribute-based encryption using access policy $\upsilon _{i}$ as $ABE_{\upsilon _{i}}()$. The data owner initializes $\digamma$ to an empty array indexed by all keywords. Then, the index $\digamma$ can be constructed as shown in Algorithm 1.

As we can see, the index $\digamma$ maps the keyword to the encrypted relevance vectors $(P)$ and the descriptors $(Dsc)$ of the documents that contain the keyword. And each list $\digamma [\omega ]$ can be transformed to be stored in the blind storage system with $\omega$ as the document id. Specifically, for each $\digamma [\omega ]$, the data owner computes $\sigma _{\omega }=\Psi _{K_{\Psi }}(\omega )$ as the seed for the function $\Gamma$ to generate the set $S_{f}$. Here, for each block of $\digamma [\omega ]$ indexed by the integer $j$, the data owner adds an encrypted header as $Enc_{(K_{T} \oplus \Phi (j))}(H(\omega )||size_{\omega })$, where $size_{\omega }$ represents the number of blocks that belong to $\digamma [\omega ]$. Finally, the data owner writes the index $\digamma$ to the blind storage system using the B.Build function.

When using the B.Build function, it is crucial to determine the way we compute the seed for generating the set $S_{f}$. We use the document id $id_{i}$ to compute the seed for the documents stored in the blind storage system, and the keyword $\omega$ to compute the seed for each $\digamma [\omega ]$. Moreover, each header of the blocks of the documents contains the encrypted $H(id_{i})$ and the first block indicates the $size_{i}$. And the blocks of index $\digamma$ are different from those of the documents. Each header of the blocks of index $\digamma$ is denoted as $Enc_{(K_{T} \oplus \Phi (j))}(H(\omega )||size_{\omega })$. This little change is for the security concerns and does not affect the implementation of the blind storage. In addition, since each block is encrypted using the key generated by the index number, the headers would be different even if the two blocks belong to the same document or the same list $\digamma [\omega ]$.

To search over the outsourced encrypted data, the search user needs to compute the trapdoor including a keyword-related token $stag$ and encrypted query vector $Q$ as follows:

*Step 1:* The search user takes a keyword conjunction $\varpi =(\omega _{1}, \omega _{2}, \cdots \omega _{l})$ with $l$ keywords of interest in $W$. A d-dimension binary query vector $q$ is generated where the j-th bit of $q$ represents whether $\omega _{j}\in \varpi$ or not. Then, the search user chooses two random numbers $r$, $t$ and scales the query vector $q$ to a ($\text{d}+2$)-dimension vector $q^{*}$ as TeX Source$$\begin{equation} q^{*}=\{rq,r,t\} \end{equation}$$ Then, the search user chooses a random number $r^{\prime }$ and splits the vector $q^{*}$ into two ($\text{d}+2$)-dimension vectors $q^{\prime }$ and $q^{\prime \prime }$. For the j-th item in $q^{*}$, set TeX Source$$\begin{equation} \left \{{\!\! \begin{array}{l}\textstyle q^{\prime }_{j}=q^{\prime \prime }_{j}=q^{*}_{j}, ~if ~S_{j}=0\\\textstyle q^{\prime }_{j}=\frac {1}{2}q^{*}_{j}+r^{\prime },\quad q^{\prime \prime }_{j}=\frac {1}{2}q^{*}_{j}-r^{\prime }, {}\quad otherwise \end{array} }\right. \end{equation}$$ The search user computes the $Q=\{M_{1}^{-1}\cdot q^{\prime },M_{2}^{-1}\cdot q^{\prime \prime }\}$ as the encrypted query vector.

*Step 2:* The search user chooses the estimated least frequent keyword $\omega ^{\prime }$ in the conjunction $\varpi$ and computes the seed $\sigma _{\omega ^{\prime }}=\Psi _{K_\Psi }(\omega ^{\prime })$. Then the search user generates a long bit-number through the function $\Gamma$ using the seed $\sigma _{\omega ^{\prime }}$. The search user chooses the sequence $\pi [\sigma _{\omega ^{\prime }},\kappa ]$ and randomly adds $\kappa$ dummy integers to the sequence. The search user downloads the blocks indexed by these $2\kappa$ integers and decrypts the header using the key $K_{T} \oplus \Phi (j)$, where $j$ is the index number of the block, to find the first block of the list $\digamma [\omega ^{\prime }]$, which consists of the descriptors and the encrypted relevance vectors of the documents containing $\omega ^{\prime }$. Then the search user obtains the $size_{\omega ^{\prime }}$ from the first block and computes the set $S_{\omega }=\pi [\sigma _{\omega ^{\prime }}, \alpha \ast size_{\omega ^{\prime }}]$. The search user randomly adds $\alpha \ast size_{\omega ^{\prime }}$ dummy integers to the set $S_{\omega }$ resulting in a set $S^{\prime }_{\omega }$ of $2 \alpha \ast size_{\omega ^{\prime }}$ integers. And the extended set $S^{\prime }_{\omega }$ is denoted as $stag$. Note that, the $stag$ consists of some dummy integers, which is for the privacy consideration.

Finally, the search user sends $Q$, $stag$ and a number $k$ to the cloud server to request the most $k$ relevant documents.

Upon receiving $Q$, $stag$, and $k$, the cloud server parses the $stag$ to get a set of integers in the range $[n_{b}]$. Then, the cloud server accesses index $\digamma$ in the blind storage and retrieves the blocks indexed by the integers to obtain the tuples $(ABE_{\upsilon _{i}}(id_{i}||K_{i} ||x),P)$ on these blocks. Note that, these blocks consist of the blocks of $\digamma [\omega ^{\prime }]$ and some dummy blocks. For each retrieved encrypted relevance vector $P$, compute the relevance score $Score_{i}$ for the associated document $d_{i}$ with the encrypted query vector $Q$ as follows:TeX Source$$\begin{align} Score_{i}=&P \cdot Q\notag \\=&\{M_{1}^{T}\cdot p^{\prime },M_{2}^{T}\cdot p^{\prime \prime }\} \cdot \{M_{1}^{-1}\cdot q^{\prime },M_{2}^{-1}\cdot q^{\prime \prime }\}\notag \\=&p^{\prime } \cdot q^{\prime }+p^{\prime \prime } \cdot q^{\prime \prime } \notag \\=&p^{*} \cdot q^{*}\notag \\=&(p,\epsilon,1)\cdot (rq,r,t)\notag \\=&r(pq+\varepsilon )+t \end{align}$$

Finally, after sorting the relevance scores, the cloud server sends back the descriptors $ABE_{\upsilon _{i}}(id_{i}||K_{i} || x)$ of the top-k documents that are most relevant to the searched keywords. Note that, as discussed before, attribute-based encryption as an access control technique can be implemented to manage search user’s decryption capability.

Upon receiving a set of descriptors $ABE_{\upsilon _{i}}(id_{i}||K_{i} || x)$, the search user can retrieve the documents as follows:

*Step 1:* If the search user’s attributes satisfy the access policy of the document, the search user can decrypt the descriptor using her secret attribute keys to get the document id $id_{i}$ and the associated symmetric key $K_{i}$. To retrieve the document $d_{i}$, compute $\sigma _{i}=\Psi _{K_\Psi }(id_{i})$ for the function $\Gamma$. Generate a sufficiently long bit-number through the function $\Gamma$ using the seed $\sigma _{i}$, parse it as a sequence of integers in the range $[n_{b}]$ and choose the first $\kappa$ integers as the set $S^{0}_{f}$. Retrieve the blocks indexed by these $\kappa$ integers from the encrypted database $D$ through blind storage system.

*Step 2:* The search user tries to decrypt these blocks using the symmetric key $K_{i} \oplus \Phi (j)$, until she finds the first block of the document $d_{i}$. If she does not find the first block, the document is not accessed in the system. Otherwise, the search user recovers the size of the document $size_{i}$ from the header of the first block.

*Step 3:* Then, the search user computes $l=\lceil \alpha \ast {size_{i}} \rceil$. If $l \leq \kappa$, compute $S_{f}=\pi [\sigma _{i}, \kappa ]$. Otherwise, compute $S_{f}=\pi [\sigma _{i}, l]$ and retrieve the rest of the blocks indexed by the integers in $S_{f}$ via the blind storage system. Decrypt these blocks and combine the blocks with the header $H(id_{i})$ in an increasing order to recover document $d_{i}$.

Here we explain how the search user retrieves one document from the blind storage system. This can form the foundation of the B.Aceess function of the blind storage. Moreover, the search user can require more than one document once by combining the sequence $S^{0}_{f}$ and $S_{f}$ of different documents in a random order. And this combination can further conceal access pattern of the search user since the cloud server even does not know the number of documents that the search user requires.

SECTION V

Under the assumption presented in Section II, we analyze the security properties of the EMRS. We give analysis of the EMRS in terms of confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability and concealing access pattern of the search user.

The documents are encrypted by the traditional symmetric cryptography technique before being outsourced to the cloud server. Without a correct key, the search user and cloud server cannot decrypt the documents. As for index confidentiality, the relevance vector for each document is encrypted using the secret key $M_{1}$, $M_{2}$, and $S$. And the descriptors of the documents are encrypted using CP-ABE technique. Thus, the cloud server can only use the index $\digamma$ to retrieve the encrypted relevance vectors without knowing any additional information, such as the associations between the documents and the keywords. And only the search user with correct attribute keys can decrypt the descriptor $ABE_{\upsilon _{i}}(id_{i}||K_{i} || x )$ to get the document id and the associated symmetric key. Thus, the confidentiality of documents and index can be well protected.

When a search user generates her trapdoor including the keyword-related token $stag$ and encrypted query vector $Q$, she randomly chooses two numbers $r$ and $t$. Then, for the query vector $q$, the search user extends it as $(rq,r,t)$ and encrypts the query vector using the secret key $M_{1},M_{2}$ and $S$. Thus, the query vectors can be totally different even if they contain same keywords. And we use the secure function $\Psi$ and $\Gamma$ to help the search user compute keyword-related token $stag$ using the secret key ${K_\Psi }$. Without the secret key $M_{1},M_{2}, S$ and ${K_\Psi }$, the cloud server cannot pry into the trapdoor. And the search user can add dummy integers to the set $S_{f}$ to conceal what she is truly searching for. Thus, the keyword information in the trapdoor is totally concealed from the cloud server in the EMRS and trapdoor privacy is well protected.

Trapdoor unlinkability is defined as that the cloud server cannot deduce associations between any two trapdoors. Even though the cloud server cannot decrypt the trapdoors, any association between two trapdoors may lead to the leakage of the search user’s privacy. We consider whether the two trapdoors including $stag$ and the encrypted query vector $Q$ can be linked to each other or to the keywords. Moreover, we would prove the EMRS can achieve trapdoor unlinkability under the $Knowing~Backgroud$ model.

To compute the encrypted query vector $Q$ that is defined as $\{M_{1}^{-1}\cdot q^{\prime },M_{2}^{-1}\cdot q^{\prime \prime }\}$ in the EMRS. First, the search user needs to extend the query vector $q$ to $q^{*}$. As we can see, the ($\text{d}+1$)-th and ($\text{d}+2$)-th entry of the vector $q^{*}$ are set to random values $r$ and $t$. So there are $2^{{\eta }_{r}}*2^{{\eta }_{t}}$ possible values, where the number $r$ and $t$ are ${\eta }_{r}$-bit or ${\eta }_{t}$-bit long, respectively. Further, the search user needs to split the vector $q^{*}$ according to the splitting vector $S$ as we discussed above. If $S_{j}=0$, the $q^{*}_{j}$ is split into two random values which add up to $q^{*}_{j}$. Suppose that the number of 0 in $S$ is $\mu$ and each dimension of the vector $q^{\prime }$ is $\eta _{q}$-bit long. We can see that ${\eta }_{r}$, ${\eta }_{t}$, $\mu$ and $\eta _{q}$ are independent of each other. Then we can compute the probability that two encrypted query vectors are the same as TeX Source$$\begin{equation} P=\frac {1}{{2^{{\eta }_{r}}}{2^{{\eta }_{t}}}{2^{{\mu } {\eta _{q}}}}}=\frac {1}{{2^{{\eta }_{r}+{{\eta }_{t}}+{\mu } {\eta _{q}}}}} \end{equation}$$ Therefore, the larger these parameters are, the lower the probability is. Hence, if we choose 1024-bit $r$ and $t$, the probability that two encrypted query vectors are the same is $P<\frac {1}{2^{2048}}$, which is negligible as a result.

As for the keyword-related token $stag$, the search user first obtains the $size_{\omega }$ from the cloud server using the sequence of $2\kappa$ integers, half of which are dummy integers. Then, the search user computes the set $S_{\omega }=\pi [\sigma _{\omega }, \alpha \ast size_{\omega }]$ and adds $\alpha \ast size_{\omega }$ dummy integers to the set $S_{\omega }$ to form the $stag$. Thus, each $stag$ contains $2 \alpha \ast size_{\omega }$ random integers, half of which are random integers. Suppose the integers are $n_{b}$ bits long. Then the probability that the two $stag\text{s}$ are the same is TeX Source$$\begin{equation} P^{\prime }=\frac {1}{2^{2 \alpha \ast size_{\omega } \ast n_{b}}} \end{equation}$$ Hence, if we choose 12-bit long $n_{b}$, 3-bit long extension parameter $\alpha$ and $size_{\omega }$ is supposed to be 8-bit long, the probability $P^{\prime }<\frac {1}{2^{576}}$, which is negligible as a result.

In Cash’s scheme [10] and Naveed’s scheme [13], for the same keyword, the search user can only compute the same $stag$ or the same set $S_{f}$. Moreover, when a search user accesses the cloud server using a keyword that has been searched before, the cloud server can learn that two search requests contain the same keyword. Under $Knowing~Backgroud$ model, the cloud server may learn the search frequency of the keywords and deduce some information using the statistic knowledge in [10] and [13].

The access pattern means the sequence of the searched results [11]. In Cash’s scheme [10] and Cao’s scheme [11], the search user directly obtains the associated documents from the cloud server, which may reveal the association between the search request and the documents to the cloud server. In the EMRS by modifying the blind storage system, access pattern is well concealed from the cloud server. Since the headers of the blocks are encrypted with the block number $j$ and each descriptor has a random padding, they would be different even if they belong to the same document. Thus, in view of the cloud server, it can only see blocks downloaded and uploaded. And, the cloud server even does not know the number of the documents stored in its storage and the length of each document, since all the documents are divided into blocks in a random order. In addition, when a search user requests a document, she can choose more blocks than the document contains. Moreover, she can require blocks of different documents at one time in a random order to totally conceal what she is requesting.

In the implementation of the blind storage system, there would be a trade-off between security guarantee and performance by the choice of parameters. We define the $P_{err}$ as the probability that the data owner aborts the document when there are not enough free blocks indexed by the integers in the set $S_{f}$ as discussed in Section IV. When this abort happens, some illegitimate information may be revealed to the cloud server [13]. We consider the following parameters $\gamma$, $\alpha$ and $\kappa$ to measure the $P_{err}$. We denote $\gamma =n_{b} / m$, where $n_{b}$ is the number of blocks in the array $B$ and $m$ is the total number of the documents stored on the cloud server. $\alpha$ is the ratio that scales the number of blocks a document contains to the number of blocks in the set $S_{f}$. $\kappa$ is the minimum number of blocks in a transaction. Then, according to [13], we can compute the $P_{err}$ as TeX Source$$\begin{equation} P_{err}(\gamma,\alpha,\kappa ) \leq {\max \limits _{n\geq \frac {\kappa }{\alpha }}} {\sum \limits _{i=0}^{n-1}} \left ({ \begin{aligned} \lceil \alpha &n \rceil \\ &i \end{aligned} }\right ) \left ({ \frac {\gamma -1}{\gamma } }\right )^{i} \left ({ \frac {1}{\gamma } }\right )^{\lceil \alpha n\rceil -i} \end{equation}$$

As we can see, the higher these parameters we choose, the lower the probability $P_{err}$ is and the higher the security guarantee would be. However, the parameters also influence the performance of the blind storage system, such as the communication and computation cost. By the choice of these parameters, the probability $P_{err}$ would be negligible [13].

The comparison of security level is shown in TABLE 2. We can see that the EMRS can achieve best security guarantees compared with the exiting schemes [10], [11], [13].

SECTION VI

Considering a large number of documents and search users in a cloud environment, searchable encryption schemes should allow privacy-preserving multi-keyword search and return documents in a order of higher relevance to the search request. As shown in TABLE 3, we compare functionalities among the EMRS, Cash’s scheme [10], Cao’s scheme [11] and Naveed’s scheme [13].

Cash’s scheme supports multi-keyword search, but cannot return results in a specific order of the relevance score. Cao’s scheme achieves multi-keyword search and returns documents in a relevance-based order. Naveed’s scheme implements the blind storage system to protect the access pattern but it only supports single-keyword search and returns undifferentiated results. The EMRS can achieve multi-keyword search, and relevance sorting while preserving a high security guarantees as discussed in Section V.

We evaluate the performance of the EMRS through simulations and compare the time cost with Cao’s [11]. We apply a real dataset National Science Foundation Research Awards Abstracts 1990–2003 [17], by randomly selecting some documents. Then, we conduct real-world experiments on a 2.8Hz-processor, computing machine to evaluate the performance of index construction and search phases. Moreover, we implement the trapdoor generation on a 1.2GHz smart phone. We would show the simulation experiments of the EMRS, and demonstrate that the computation overhead of index construction and trapdoor generation are almost the same compared with that of Cao’s [11]. Then we would compare the execution time of search phase with Cao’s [11] and show that the EMRS achieves better search efficiency.

Index construction in the EMRS consists of two phases: encrypted relevance vector computation and the efficient index $\digamma$ construction via blind storage.

As for the computation of encrypted relevance vector, the data owner first needs to compute the relevance score for each keyword in each document using the $TF-IDF$ technique. As shown in Fig. 2, both the size of the dictionary and the number of documents would influence the time for calculating all the relevance scores. Then, to compute the encrypted relevance vector $P$, the data owner needs two multiplications of a $(d+2)*(d+2)$ matrix and a ($\text{d}+2$)-dimension vector with complexity $O(d^{2})$. The time cost for computing all the encrypted relevance vectors is linear to the size of the database since time for building subindex of one document is fixed. Thus, the computation complexity is $O(md^{2})$, where $m$ represents the number of documents in the database and $d$ represents the size of the keyword dictionary $W$. The computation complexity is as the same as that in Cao’s [11]. The computational cost for computing the encrypted relevance vectors is shown in Fig. 3. As we can see, both the size of the dictionary and the number of documents would affect the execution time.

Finally, we adopt the index $\digamma$ via the blind storage in the EMRS to improve search efficiency and conceal the access pattern of the search user. For each keyword $\omega \in W$, we need to build the list $\digamma [\omega ]$ of tuples $(ABE_{\upsilon _{i}}(id_{i}||K_{i} || x),P)$ of documents that contain the keyword and upload it using the B.Build function. So the computation complexity to build the index $\digamma$ is $O(\varrho d)$, where $\varrho$ represents the average number of tuples contained in the list $\digamma [\omega ]$ and is no more than the number of document $m$. Since the access pattern is not considered in most schemes, we are not going to give the specific comparison of the implementation of the blind storage [13] in the EMRS.

In the EMRS, trapdoor generation consists of $stag$ and encrypted query vector $Q$. To compute $stag$, the search user only needs two efficient operations ($\Psi$ and $\Gamma$) to generate a sequence of random integers. Compared with time cost to compute the encrypted query vector which is linearly increasing with the size of the keyword dictionary, time cost for computing $stag$ is negligible. As for computing the encrypted query vector $Q$, the search user needs to compute two multiplications of a $(d+2)*(d+2)$ matrix and a ($\text{d}+2$)-dimension vector with complexity $O(d^{2})$. Thus, the computation complexity of trapdoor generation for the search user is $O(d^{2})$, which is as the same as that in Cao’s scheme [11]. As shown in Fig. 4, we conduct a simulation experiment on a 1.2Ghz smart phone and give the experiment results for computing trapdoor in the EMRS.

Search operation in Cao’s scheme [11] requires computing the relevance scores for all documents in the database. For each document, the cloud server needs to compute the inner product of two ($\text{d}+2$)-dimension vectors twice. Thus, the computation complexity for the whole data collection is $O(md)$. As we can see, the search time in Cao’s scheme linearly increases with the scale of the dataset, which is impractical for large-scale dataset.

In the EMRS, by adopting the inverted index $\digamma$ which is built in the blind storage system, we achieve a sublinear computation overhead compared with Cao’s scheme. Upon receiving $stag$, the cloud server can use $stag$ to access blind storage and retrieve the encrypted relevance vector on the blocks indexed by the $stag$. These blocks consist of blocks of documents containing the $stag$-related keyword and some dummy blocks. Thus, the EMRS can significantly decrease the number of documents which are relevant to the searched keywords. Then, the cloud server only needs to compute the inner product of two ($\text{d}+2$)-dimension vectors for the associated documents rather than computing relevance scores for all documents as that in Cao’s scheme [11]. The computation complexity for search operation in the EMRS is $O(\alpha \varrho _{s} d)$, where $\varrho _{s}$ represents the the number of documents which contain the keyword applied by the keyword-related token $stag$ and the $\alpha$ is the extension parameter that scales the number of blocks in a document to the number of blocks in the set $S_{f}$. The value of $\varrho _{s}$ can be small if the search user typically chooses the estimated least frequent keyword, such that the computation cost for search on the cloud server is significantly reduced.

As shown in Fig. 5, the computation cost of search phase is mainly affected by the number of documents in the dataset and the size of the keyword dictionary. In our experiments, we implement the index on the memory to avoid the time-cost I/O operations. Note that, although the time costs of search operation are linearly increasing in both schemes, the increase rate of the EMRS is less than half of that in Cao’s scheme.

When the system is once setup, including generating encrypted documents and index, the communication overhead is mainly influenced by the search phase. In this section, we would compare the communication overhead among the EMRS, Cash’s scheme [10], Cao’s scheme [11] and Naveed’s scheme [13] when searching over the cloud server. Since most existing schemes of SSE only consider obtaining a sequence of results rather than the related documents, the comparison here would not involve the communication of retrieving the documents.

In Cao’s scheme [11], the search user needs to compute the trapdoor and send it to the cloud server. Then it can obtain the searched results. The communication overhead in Cao’s is $2(d+2)\eta _{q}$, where $d$ represents the size of the keyword dictionary and each dimension of the encrypted query vector is $\eta _{q}$-bit long. According to Cash’s scheme [10], when a search user wants to query over the cloud server using a conjunctive keyword set $\varpi$, she needs to compute $stag$ for the estimated least-frequent keyword and $xtoken\text{s}$ for the other keywords in the set $\varpi$. And, each $xtoken$ contains $|\varpi |$ elements in $G$, where $G$ is a group of prime order $p$. Moreover, the search user needs to continuously compute the $xtoken$ until the cloud server sends stop, which indicates that the total number of the $xtoken\text{s}$ is linear to $\varrho$, the number of documents containing the keyword related to the $stag$. This results in much unnecessary communication overhead of $\varrho |\varpi | |G|$, where $|G|$ represents the size of an element in $G$. In Naveed’s scheme [13], since the index is constructed in the blind storage system, the search user may need to access the blind storage system to obtain the $size_{\omega }$ and then obtain the results. This requires one or two round communication of $\alpha *size_{\omega } *n_{b}$ bits, where $\alpha$ is the extension parameter, $size_{\omega }$ is the number of blocks of documents containing $\omega$, and each index number is $n_{b}$-bit long. In the EMRS, we modify the way the search user computes the sequence $S_{f}$ that indexes the blocks by adding some dummy integers to $S_{f}$ to conceal what the search user is searching for. The communication comparison is shown in TABLE 4. As we can see, even though the EMRS requires a little more communication overhead, the EMRS can achieve more functionalities compared with [10], [13] as shown in TABLE 3 and better search efficiency compared with [11] as shown in Fig. 5.

Note that the communication overhead in our paper is higher than that in the Cao’s scheme. But the higher communication overhead will not severely affect the user’s experience. This is because that the communication overhead is mainly incurred by the exchange of short signaling messages and can be transmitted in a very short time. Moreover, with the adoption of advanced wireless technology, such as 4G/5G and IEEE 802.11ac, the communication delays tend to further reduce and negligible. As a theoretical framework, in this paper, we target to a prototype system and expose our proposal to the public. As such, based on the specific deployment scenarios, e.g., whether communication bandwidth is expensive and precious or not, to modify our proposal for real-world implementation.

The size of the returned results in the EMRS is mainly affected by the choice of the security parameters, $\alpha$ and $\kappa$. The larger these two numbers are, the higher security guarantee the scheme provides, as we discussed in Section V. The size of returned results for each document can be $a*size_{\omega }$ blocks, which contain the blocks of searched document and dummy blocks. Moreover, the search user can require many documents at one time and thus can avoid requesting dummy blocks. The EMRS provides balance parameters for search users to satisfy their different requirements on communication and computation cost, as well as privacy.

SECTION VII

Searchable encryption is a promising technique that provides the search service over the encrypted cloud data. It can mainly be classified into two types: Searchable Public-key Encryption (SPE) and Searchable Symmetric Encryption (SSE).

Boneh et al. [18] first propose the concept of SPE, which supports single-keyword search over the encrypted cloud data. The work is later extended in [19] to support the conjunctive, subset, and range search queries on encrypted data. Zhang et al. [20] propose an efficient public key searchable encryption scheme with conjunctive-subset search. However, the above proposals require that the search results match all the keywords at the same time, and cannot return results in a specific order. Further, Liu et al. [21] propose a ranked search scheme which adopts a mask matrix to achieve cost-effectiveness. Yu et al. [15] propose a multi-keyword retrieval scheme that can return the top-k relevant documents by leveraging the fully homomorphic encryption. [22], [23] adopt the attribute-based encryption technique to achieve search authority in SPE.

Although SPE can achieve above rich search functionalities, SPE are not efficient since SPE involves a good many asymmetric cryptography operations. This motivates the research on SSE mechanisms.

The first SSE scheme is introduced by Song et al. [24], which builds the searchable encrypted index in a symmetric way but only supports single keyword. Curtmola et al. further improve the security definitions of SSE in [25]. Their work forms the basis of many subsequent works, such as [10], [13], and [26], by introducing the fundamental approach of using a keyword-related index, which enable the quickly search of documents that contain a given keyword. To meet the requirements of practical uses, conjunctive multi-keyword search is necessary which has been studied in [11] and [15]. Moreover, to give the search user a better search experience, some proposals [27], [28] propose to enabled ranked results instead of returning undifferentiated results, by introducing the relevance score to the searchable encryption. To further improve the user experience, fuzzy keyword search over the encrypted data has also been developed in [7] and [29].

Cao et al. [11] propose a privacy-preserving multi-keyword search scheme that supports ranked results by adopting secure $k$-nearest neighbors (kNN) technique in searchable encryption. The proposal can achieve rich functionalities such as multi-keyword and ranked results, but requires the computation of relevance scores for all documents contained in the database. This operation incurs huge computation overload to the cloud server and is therefore not suitable for large-scale datasets. Cash et al. [10] adopt the inverted index $TSet$, which maps the keyword to the documents containing it, to achieve efficient multi-keyword search for large-scale datasets. The works is later extended in [26] with the implementation on real-world datasets. However, the ranked results is not supported in [26]. Naveed et.al. [13] construct a blind storage system to achieve searchable encryption and conceal the access pattern of the search user. However, only single-keyword search is supported in [13].

SECTION VIII

In this paper, we have proposed a multi-keyword ranked search scheme to enable accurate, efficient and secure search over encrypted mobile cloud data. Security analysis have demonstrated that proposed scheme can effectively achieve confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Extensive performance evaluations have shown that the proposed scheme can achieve better efficiency in terms of the functionality and computation overhead compared with existing ones. For the future work, we will investigate on the authentication and access control issues in searchable encryption technique.

This work was supported in part by the International Science and Technology Cooperation and Exchange Program of Sichuan Province, China, under Grant 2014HH0029, the China Post-Doctoral Science Foundation under Grant 2014M552336, and the National Natural Science Foundation of China under Grant 61472065, Grant 61350110238, Grant U1233108, Grant U1333127, Grant 61272525, and 61472065.

Corresponding Author: H. Li

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions