uFETCH: A Unified Searchable Encryption Scheme and Its Saas-Native to Make DBMS Privacy-Preserving

,


I. INTRODUCTION
As General Data Protection Regulation (GDPR) [1] and similar laws are acting in more countries, people are much aware of their rights on privacy. Because of these laws, companies doing business with private data have no options but to protect them with full strength. With raised awareness, it would make no sense to private data owners if the companies, working as a cloud service provider (CSP) in the era of cloud, further moves private data to yet another CSP for repository. Despite such a practice so common in 3-tier cloud structure [2] that enables agility and more, as long as GDPR is considered, out-sourcing repository does make CSP harder to claim compliance since the out-sourced repository is at best semi-trusted [3]. Therefore, a challenge arises: how can CSP preserve the privacy of its users while being able to out-source repository?
The associate editor coordinating the review of this manuscript and approving it for publication was Gianmaria Silvello .
Intuitively, encryption such as AES seems a solution for CSP to address the challenge. For example, CSP can encrypt private data before storing them upon cloud repository. If CSP only keeps the key(s), virtually neither cloud repository nor CSP can see the private data except the epochs of data processing. This way of managing data not only retains the 3-tier cloud structure but also makes CSP free from escorting private data all the time. Technique as such also eases implementing ''the right to be forgotten'' in GDPR because simply deleting a key destroys all associated data at once.
However, as CSP has to process data as fast as possible, mere encryption would make CSP a nightmare. That is, all encrypted data have to be retrieved back from cloud repository before decrypting for the ''wanted''. Though, with traditional encryption, only retrieving all can guarantee a perfect recall necessary for business, it does suffer the worst precision by metrics of Information Retrieval (IR) [4].
To get decent precision, solutions for commissioned search among encrypted items have been proposed. Among them, Searchable Encryption (SE) and other encrypted-search techniques stand out to enable efficient retrieval. However, if CSP takes a deeper look at these solutions, obstacles can be found that explain why these ''near-practical'' solutions are still rarely deployed.

A. EXISTING TECHNIQUES AND RESTRICTIONS
Foreseeing privacy issues at the dawn of cloud era, SE has long been proposed to encrypt data flowing to cloud while preserving IR ability for those who have the key. However, the name of SE is often misleading that seems to bring forth special encryption whose ciphertext is searchable. In fact, except the proposal of Song et al. [5], almost all SE schemes actually build subsidiary SE index(es) for word(s) that represents the data. With SE index(es) associated with encrypted data, retrieving encrypted data becomes a matching problem. That is, with the right key, SE Trapdoor can be built for a word to match through SE index(es). Since this indexbased approach [6], [7] emerged, a variety of SE schemes have been proposed for single-keyword search, ranging from offering simple equality queries [5]- [7] to annexing queries with better expressiveness such as fuzzy [8], [9] and wildcard [10]- [15]. For multiple-keyword search, conjunctive schemes [10], [16]- [18] have been proposed; in case that keywords have to be consecutive, phrase [19]- [22] and even multiphrase [23] can be useful. But one thing in common with them is: when talking about SE, schemes are dedicated to matching for word(s).
Besides words, privacy issue also attracts lots of attention on privacy-preserving search for encrypted numbers. This line of research is often seen in database community. Such works fill the vacancy in the paradigm of database management system (DBMS) where numbers usually dominate. But unlike SE, this community are more liberal to engage skills beyond cryptography. For example, in 2002 Hacigümüş et al. [24] proposed a bucketing technique with SQL translations to map range queries onto buckets recalls. Since then, a number of varying proposals appeared with many aiming at database-as-a-service (DaaS). While many works proposed to modify DBMS in order to inject techniques such as bivariate [25], PBtree [26], trusted hardware [27] and multi-party computation [28], some preferring not to. Those techniques that requires no server modification, e.g. [24], [29] and [30], are regarded as SaaS-native. But SaaSnative or not, all these techniques are dedicated to looking for number(s).
Of course, neither words nor numbers alone are sufficient to do business as CSP inevitably has to face databases mixed with textual and numerical data. But, with the development of encrypted-search techniques as summarized, it is a nature presumption that heterogeneous techniques across data types have to be intertwined to deal with general databases. Such a presumption is somewhat affirmed after CryptDB [31] integrated techniques of order-preserving encryption (OPE) [30], homomorphic encryption (HE) [32] and SE [5] to make DBMS privacy-preserving. Remarkably, by picking only SaaS-native techniques, CryptDB is gaining popularity as it fits well with DaaS that forbids any DBMS retrofit. In fact, such an integration even inspired works such as MONOMI [33], L-EncDB [22] and Seabed [34]; all putting efforts on intertwining heterogeneous techniques that we propose to refrain from.

B. CONTRIBUTIONS
This work proposes a solution for CSP to address the challenge of preserving user privacy with out-sourced repository. We introduce a unified SE scheme called uFETCH (unified Frequency-Eliminated Trapdoor-Character Hopping) that enables efficient encrypted-search across data types. With the merit of SaaS-native to make off-the-shelf DBMS privacy-preserving, uFETCH brings the following contributions to advocate cloud security and privacy.  [7]. However, we notice that such notions are actually meant for 2-tier client-server structure, instead of the 3-tier structure recurring in cloud scenarios. In effect, such ''over-spec'' security often leads to inefficient SE schemes whenever better query expressiveness is needed. For example, in Table 1, SE schemes supporting wildcard-based pattern search are often bounded by linear complexity unfit for increasing data in cloud. uFETCH can break that bound with adequate security in 3-tier cloud structure for the sake of commercial-grade performance.
In the following, Section II first reviews some SE primitives devised for SaaS-native, setting a background for the VOLUME 8, 2020 TABLE 1. Existing SE schemes supporting wildcard-based pattern search are bounded by linear complexity, but our works break that bound by leveraging existing DBMS capability in 3-tier cloud structure (n and m are the number of indexed documents and words respectively).
proposed scheme. The proposed uFETCH and its security discussions for 3-tier cloud structure are given in Section III. The simple security agent enabled by uFETCH is demonstrated in Section IV, followed by experimental results in Section V. Finally, a conclusion is given in Section VI.

II. SE PRIMITIVES FOR SaaS-NATIVE
In this section, SE primitives [35] devised for textual-only encrypted search are reviewed as a background. Note that, the SE primitives along are not secure but serves to set up operations and notations to be adopted in the proposed SE scheme. The operations and notations illustrate how the problem of encrypted search (using wildcards in particular) can be transformed into a problem of subsequence matching for SaaS-native.

A. INDEX CONSTRUCTION
Inspired by distorting mirrors in physical world, the SE primitives reflect and project an object (a given word) onto an obfuscating image (an SE index) mathematically; and in case that you cover (wildcardize) a part of the object (the word), the part that is not covered still gets reflected and projected onto another image (an SE trapdoor) which, though still obfuscating, will be in part the same as the uncovered image (the SE index) and thus can be used as a clue for matching. But unlike distorting mirrors, ''garbling'' is introduced, resulting in three steps of reflecting, projecting and garbling for building an SE index, as shown in Fig. 1.
More precisely, given a word p = p 1 p 2 . . . p N , in the step of reflecting, characters of p are ''reflected'' by multiplying with a matrix R of M -by-N entries defined as with each r m∈{1,2,...,M } defined as a shuffled version of an N -tuple vector z x 0 x 1 . . . x α−1 0 0 . . . 0 , wherein each non-zero entry is a monomial called a reflecting pointer and α defines the number of them in each row of R. We denote the In the step of projecting, each of the polynomials is projected onto a character. A keyed hashing such as HMAC can be conducted over the coefficients of each polynomial with a secret keyk $ ← − {0, 1} n , and the first character of the output hash is selected as the projected character. Denoting such a projecting as , we obtain an image Finally, for the step of garbling, the image i is partitioned into λ ''rooms'' specified by a sequence b of increasing integers, i.e.
Assuming b is randomly-generated according to (4) that we denote as b Aligned with the partitioned i, a Garbler G is defined as where each π l∈{1,2,...,λ} is a randomly-generated permutation matrix [36] Note that, the resulted idx is SaaS-native because it can be stored as a string, i.e. the most common type in DBMS.

B. TRAPDOOR CONSTRUCTION
Before showing how SE trapdoors can be built for wildcard queries, we first define what a wildcard query is: is a wildcard query if it is made by replacing zero or more characters of a certain word p = p 1 p 2 . . . p N with wildcard , that represents any single character. And we denotẽ q p ifq matches p. More precisely, By this definition and the notation, wildcard query is seen as a generalization of equality query. Given a wildcard queryq = q 1q2 . . .q N , to conduct encrypted-search through SE indexes, an SE trapdoor can be built with the same {R,k, b}. As illustrated in Fig. 2, the procedure of building SE trapdoor for the wildcard query also comprises three steps: reflecting, projecting and sifting. In reflecting, If there is any inq, it would get reflected among the coefficients ofṽ 1 (x),ṽ 2 (x), . . . ,ṽ M (x). For each polynomial containing at least one , we force the result of projecting to be a . In general, any operation dealing with is forced to output , resulting in ''wildcard propagation''. Therefore, one might obtain a -involved Aligned with the partitioned t, a Sift S is defined as wherein each σ l∈{1,2,...,λ} is a randomly-generated selection vector [36] of dimension b l that has exactly one entry of 1 and 0's elsewhere. By σ l∈{1,2,...,λ} , one character per the room t l∈{1,2,...,λ} is selected (i.e. mux). Thereby, an SE trapdoor is built as Denoting trap idx if trap is a subsequence of idx, it can be observed thatq as proved in [37]. With this property, cloud repository can conduct a textual search simply by checking whether a given trap is a subsequence to any of stored idx's. That is, with noq nor p ever exposed to, cloud repository can be commissioned to search while remaining ignorant. The resulted trapdoor trap is SaaS-native because it only requires subsequence matching, a well-supported feature in most DBMS.

III. PROPOSED UNIFIED SE SCHEME -uFETCH
The property of (14) indicates the reviewed SE primitives as SaaS-native. However, the SE primitives can only serve words of fixed-length of N , that is too restricted for most scenarios. Besides, when feeding a word of same characters, the index construction will lead to an idx of same character. For example, a word ''000. . .0'' would lead to an idx like ''xxx...x'' that leaks the word pattern.
In this section, the SE primitives are extended and formalized into the proposed uFETCH of four algorithms {KeyGen, BuildIndex, Trapdoor, Match}. uFETCH frees the word-length restriction and leaks no word pattern. Being a unified SE scheme, uFETCH also builds indexes for numbers with range queries enabled.
Our initial idea is to ''normalize'' all words to N -character long, and if N is configured large enough, the scheme can then cover most words. Interestingly, it happens that such normalization can even apply to numbers as long as one can find ways that not only transform any number d ∈ [min, max] into a word p = p 1 p 2 . . . p N , but also any range r ⊂ [min, max] intoq = q 1q2 . . .q N such that if d ∈ r theñ q p. Then, leveraging (14), a perfect recall for numerical encrypted search is assured. We found one feasible way to achieve it is by N projections that project r and d ontoq and p respectively.
A. uFETCH SCHEME KEY

Algorithm 1 KeyGen
Input : a unary 1 n , min and max Output: a uFETCH scheme key k An instance of uFETCH is defined by a scheme key k ← {R,k, b, n} with public configuration of N , M , α, λ, L and (min, max). Besides the ingredients of the reviewed SE primitives, n is introduced as a vector containing N secret numbers n 1 , n 2 , . . . , n N within (min, max) that configure N projections in algorithms BuildIndex and Trapdoor and are used exclusively for numbers.
Note that, these secret numbers may or may not be randomly generated, as they can be specified per a known distribution to conceal it, as detailed in Section III-E4.

B. uFETCH INDEX
The algorithm BuildIndex embodies our idea to ''normalize'' a datum. That is, when a textual word is given, steps are taken to make it a word of N -character long. But if it is a number, N projections are instead conducted to make it a N -character word too. Thereby, regardless of the data type, the same procedure of projecting, reflecting and garbling can follow up to build uFETCH indexes.

Algorithm 2 BuildIndex
Input : a scheme key k = {R,k, b, n}, a datum d Output: a uFETCH index idx // S1. project onto a N-char binary word // S3. pattern conceal and make N-char long BuildIndex comprises stages of S1 to S4. S1 checks if a given datum d is a number. If asserted, it transforms d into a N -character word. That is, N binary comparisons are conducted with their binary results concatenated ( ) into a word w. Thereby, after S1, w is obtained even if it is a number. In S2, by a pseudo random-random function (PRF) f (k, ·), a string of PRF characters is generated with the length of w as the seed. The first part of the string is used to pad short words to make them at least L-character long. The remaining PRF characters are used in S3 to wipe out the pattern of w by XOR. S3 repeats XORing until a N -character p is concatenated. Finally, the background primitives are engaged in S4 to build the corresponding SE index. Note that, the PRF f (k, ·) can also be implemented by keyed-hashing such as HMAC, and we index the characters of a PRF string by [a . . . b], stating a-th to b-th character is used.
It can be observed that S1 is so simple that, when a given datum d is a large (small) number, S1 could result in a word w comprising many 1's (0's) as d is larger (smaller) than most of n 1 , n 2 , . . . , n N . In an extreme case, w could be a string of ''111...1'' or ''000..0'' that we cautioned. This, however, is fine because in S3 such patterns will be wiped out after XORing with the PRF sequence h.

C. uFETCH TRAPDOOR
To build a trapdoor that can identify uFETCH indexes under the same scheme key k, Trapdoor first checks whether a given query q is a word or a range. When a word is given, similar steps are taken to normalize it a word of N -character long, except any wildcard will ''propagate'' through normalization.

Algorithm 3 Trapdoor
Input : a scheme key k = {R,k, b, n}, a query q Output: a uFETCH trapdoor trap // S1. project onto a N-char ternary word 1 if q is a range then

) character(s) from trap
Aligned with the pseudo codes of BuildIndex, when a numerical range is given, q has to be first projected onto a N -character word so as to comply with S2. Since a range is not a number but stands for a set of numbers usually bounded by two numbers, say (q 1 , q 2 ), the comparison in the line 4 requires special treatment. Specifically, the condition > is true and symbolized as 1 if and only if both q 1 and q 2 are larger than n i . In case that both q 1 and q 2 are NOT larger than n i , the result is false and symbolized as 0. Otherwise, is designated. For example, because (10,20]>15 is neither true nor false, is designated for that comparison. Thereby, after S1, a range is made into a N -character ternary word of characters {0, 1, }. If a wildcard query is given, S1 is skipped and S2 takes place to pad the query to be at least L-character long if it is not. But padded or not, S3 follows up to make all words to be N -character long while using XORing with PRF sequence h to conceal original word pattern. Note that, the protocol of wildcard propagation will make each wildcard propagated through XOR. Thereby, if q contains , the wordq input to S4 will also contain propagated ('s).
In S4, after the steps of reflecting, projecting and sifting (in lines 16∼ 18), each is removed from trap. If trap is longer than the threshold λ/2 , the remaining characters are randomly trimmed out. The trimming tries to make the resulted uFETCH trapdoor to be of the same length of λ/2 if possible. It helps conceal search pattern as to be discussed.

D. uFETCH ENCRYPTED SEARCH
What makes uFETCH extraordinary and attractive is that it transforms the problem of encrypted search into the simple problem of subsequence matching, with trapdoors telling nothing about the data type being searched.

Algorithm 4 Match
Input : an SE trapdoor trap, an SE index idx Output: True or False 1 output True if trap idx; otherwise, output False As shown in the algorithm Match, to check if a query identifies any encrypted item, cloud repository can simply check if a given SE trapdoor trap is a subsequence to any stored SE index idx, as denoted by the operator . With a typeless trapdoor, cloud repository is made ignorant of the data type when it is commissioned to search, not even the number of wildcards in use if any. Thanks to the implication of (14), under the same scheme key k = {R,k, b, n}, uFETCH brings a perfect recall rate of 1 regardless of the configuration of N , M , α, λ, (min, max) and L. Though, for a good precision, configuration does matter as exemplified in Section V.

E. SECURITY DISCUSSIONS
Typically, CSPs provide their services running in the middle of a 3-tier structure as shown in Fig. 3, wherein private data fed from the tier-1 are processed in tier-2 but stored in the tier-3. Such a 3-tier structure is widely-adopted because the tier-3 cloud repositories can free CSP from tedious-yet-serious storage management.
In the 3-tier cloud structure, repository providers see each CSP as a tenant regardless how many users it has. In fact, assuming no collusion between the CSPs and repository providers, repository providers cannot know the number nor the identity of users in tier-1 if CSP enforces encryption, just as what cloud service A does in Fig. 3. Please note that, to utilize any security proxy/agent such as CryptDB [31] or the one we are about to propose in tier-2, it is assumed that CSP is allowed to see private data in plaintext. Though CSP is entrusted by tier-1 users to see data in order to process them as fast as possible for the service it offers, CSP has to make tier-3 cloud repository ignorant of all the (processed) data it uploads This assumption is generic to SE schemes and existing SE-integrated tools such as CryptDB [31], MONOMI [33], L-EncDB [22] and Seabed [34] as long as they are used in tier-2 to help CSP protect tier-1 user privacy against adversarial DBMS in tier-3.
With the 3-tier cloud structure, security notions such as IND-CKA [6], [7] seem ''over-spec'' to assure each SE index indistinguishable from one another. Though notions as such are tough, they are actually meant for 2-tier clientserver structure, wherein user identity is often known. As user identity can be exploited to apply domain-knowledge or side-channel to compromise the meaning of certain SE index, IND-CKA prevents any compromised SE index from breaching another.
However, it is not the case with the 3-tier cloud structure, assuming all data are anonymised by encryption and put under tenants' umbrella. Thereby, it is arguable that SE indexes have no need to be indistinguishable from one another, as long as they properly conceal the data.

1) CFA-IMMUNITY IN 3-TIER CLOUD STRUCTURE
Assuming encryption is enforced in the 3-tier cloud structure, with no user identity to exploit, adversaries inside cloud repository can be modelled as accessible to some open dictionaries. However, this alone could lead to the notorious Character Frequency Attack (CFA).
uFETCH is immune to CFA. By reflecting, projecting and engaging Garbler G and Sift S, it makes character frequency eliminated (FE). Namely, character frequencies are ''merged'' after reflecting-and-projecting, with the merged frequencies further ''blended'' by Garbler G and Sift S before producing index and trapdoor respectively. From uFETCH indexes and trapdoors, adversaries are only left with severely distorted frequencies that are useless to launch CFA.
Furthermore, uFETCH can conceal search pattern [7] to prevent user's query behaviour from being analyzed. That is, given the same query repeatedly, Trapdoor is able to build different trapdoors with all reaching the same goal. Trapdoor also tries to make all trapdoors the same length to conceal the wildcard usage in a textual query or the wideness of a range query. By randomly punching out character(s), Trapdoor builds dynamic trapdoors that seem to have trapdoor characters hopping (TCH) around. Combining the techniques of FE and TCH, uFETCH protects its indexes and trapdoors from CFA.

2) MAKING INDEX ENTROPY CLOSE TO WHITE NOISE
Whenever necessary, uFETCH can actually be better than CFA-immunity. As a case study, using a configuration of N = 40, M = 200, α = 16, λ = 40, L = 7, (min, max) = (0, 100) and HMAC-SHA256 as the PRF f with base64 characters output, the entropy of built indexes can be made quite close to that of White Noise. Though this configuration pulls down precision (comparing to the instance in Section V), it can even be considered for the 2-tier client-server structure.
Using KJV Bible as an example, we first check the ensemble character distribution along with the entropy of all unique words in KJV Bible. Then, after BuildIndex, the ensemble character distribution and the entropy of the corresponding uFETCH indexes are examined. As the perfect entropy of base64 is 6, uFETCH can make characters distributed very close to White Noise, with an near-perfect entropy of 5.9998 as shown in Fig. 4.

3) MAKING INDEX INDISTINGUISHABLE
To check how indistinguishable uFETCH indexes can be, under the same configuration, uFETCH indexes are built for alphabetically-similar words and put along with the uFETCH indexes for numbers within (0,100). We check if similar words or nearby numbers will have their uFETCH indexes related to each other. Why similar words? Intuitively, with the same uFETCH scheme key k, it can be anticipated that similar words will get reflected-and-projected similarly. Thus, uFETCH indexes built for similar words might have same characters spread as the common subsequence between uFETCH indexes. There is the same concern for nearby numbers. Therefore, we are curious about whether uFETCH can even eliminate such a clue, if necessary.
As shown Fig. 5, longest common subsequence (LCS) length between FETCH indexes is measured to see if it distinguishes similar words or nearby numbers. Because of the noise resulted from ''redundant'' characters prepared in rooms for TCH, the LCS length between similar words and not-similar words can be quite indistinguishable. However, nearby numbers would have indexes with longer LCS. Though this clue could be used to ''group'' nearby numbers, it tells no relative magnitude as OPE [30] leaks.

4) FLATTENING NUMERICAL INDEXES
uFETCH can be seen as a variant to the bucketing technique proposed by Hacigümüş et al. [24] because numbers are virtually ''bucketed'' by uFETCH indexes, with each uFETCH index designated as an identifier. Therefore, the distribution of uFETCH indexes reveals the numerical distribution.
To conceal the distribution, as proposed by Hore et al. [38], it is good to adjust bucketing so that each bucket has roughly the same quantity of items. uFETCH offers an option to do so. By generating the secret numbers n 1 , n 2 , . . . , n N according to the known distribution to be concealed, the index distribution can be flattened. For example, given a dataset of Gaussian distribution, secret numbers with Gaussian distribution can be generated and used in the scheme key k accordingly.

IV. SECURITY AGENT FOR OFF-THE-SHELF DBMS
uFETCH can bring up simpler designs that make off-the-shelf DBMS privacy-preserving. To demonstrate the simplicity, a security agent is built that relies only on the uFETCH for encrypted-search across data types, instead of integrating heterogeneous techniques.
As shown in Fig. 6, the security agent resides with CSP in tier-2 to let it move private data to tier-3 DBMS with privacy preserved. Whenever data are fed from users in tier-1, CSP stores the (processed) data to the tier-3 DBMS using SQL statements as usual but only via the security agent. CSP can of course query and update data by SQL as well. With the keys specified and kept by CSP, the security agent ''translates'' all SQL statements into SQL-aware encrypted statements so that all data are concealed before landing on tier-3. The security agent processes SQL statements with three overhead levels: L1, L2 and L3.

A. L1: SIMPLE TRANSLATION
As a database has at least one To derive the statement to tier-3 DBMS, simple translation is conducted over all fields specified by CREATE TABLE (except field(s) of no privacy concern such as auto-serialized ID). Thus, for each pair of {fieldname, type} • type is changed to VARCHAR • an extra pair of {fieldname_idx, VARCHAR} is added Then, with a table created as specified in tier-3 DBMS, CSP can populate it by INSERT statements. As INSERT specifies pairs of {fieldname, datum}, simple translation is conducted for each pair such that • datum encrypted by encryption such as AES • an extra pair of {fieldname_idx, BuildIndex(k, datum)} is added, with k a uFETCH scheme Note that, uFETCH builds an idx for a number regardless it is of INT, FLOAT or types of other resolution, and a unified field, e.g. VARCHAR, suffices to accommodate it as a string.

B. L2: SECOND-STAGE FILTERING
After a table is populated with encrypted data, CSP can selectively retrieve them back by SELECT statements via the security agent. However, because of possible false-positives, such a selective retrieval requires a second-stage filtering that slightly pushes the overhead to L2. VOLUME 8, 2020 The translated SELECT asks tier-3 DBMS to conduct subsequence matching on the index field, instead of the field of encrypted data, and is sent as L2:1 in Fig. 6. After a result set is back (L2:2), the security agent decrypts it and put the plaintext one into the local DBMS for second-stage filtering by the original SELECT (L2: [3][4]. Then the result set with a perfect precision is returned to CSP (L2:5) with each type restored according to the local table (e.g. club_cnf).
Note that, the security agent recognizes and combines two conditions into one if they constitute a range query. For example, SELECT * from club WHERE age > 16 and age < 28 will be translated into SELECT age from club WHERE age_idx LIKE Trapdoor(k, (16,28)), with SQL wildcard % put in front of, in rear of and in between trapdoor characters for subsequence matching. However, for a single-ended comparison, min or max will be used to complete a range. For example, Trapdoor(k, (16, max)) is used to build a trapdoor for age > 16.

C. L3: SELECTIVE UPDATING
Besides SELECT statements, UPDATE and DELETE can also come with WHERE followed by conditions. In such cases, similar translation can also be conducted over the conditions. However, to avoid corrupting data because of false-positives, the security agent first sends SELECT instead.
That is, given a UPDATE or DELETE statement, a SELECT statement with WHERE appended with translated conditions (pruned from UPDATE or DELETE) is sent as L3:1 in Fig. 6. After the tier-3 DBMS returns a result set L3:2, the security agent decrypts it and put the plaintext one into the local DBMS for second-stage filtering (L3:3-4) by the SELECT except this time all conditions are in plaintext. With all falsepositives filtered out, the result set gives the security agent a result set with precise identifiers id 1 , id 2 , . . . (e.g. of the field ID) to update or delete. Thereby, as shown by L3:5, an UPDATE or DELETE statement is sent to tier-3 DBMS with WHERE followed by a list of ID = id 1 OR ID = id 2 OR . . . to update affected records precisely, with affected fields ''simple translated'' as L1 does.

V. EXPERIMENTAL RESULT
A computer with an Intel i7-7700 CPU running at 3.6GHz with 16GB RAM is used in tier-2 to run both a security agent and a CSP client. The CSP client creates the example table club in tier-3 and populates it with 2 10 to 2 19 records. Each record has a name randomly-imported from a name dataset [39] along with age, weight uniform-distributed in (0,100).
We set up tier-3 DBMS, i.e. a MySQL server, within the same computer to eliminate network inference while our security agent can be replaced by CryptDB [31] for comparison. To leverage MySQL internal index for better search speed, an index is created for each field of club. The security agent is equipped with a uFETCH instance configured with N = 40, M = 200, α = 7, λ = 40 and L = 16 with (min, max) set to (0, 100). If not explicitly stated, each of measured points are an average of 1,000 randomly-generated samples.
Note that, there are works also putting efforts on heterogeneous integration such as MONOMI [33], L-EncDB [22] and  Seabed [34]. However, only CryptDB is open to be evaluated and put along with this work. As shown in Fig. 7, after CSP incrementally INSERT 2 10 to 2 19 records, we observe the average insert time of the security agent very close to that of no encryption, i.e. executing SQL statements in plaintext.
However, given the same dataset, CryptDB requires about at least 150ms to make each INSERT done in average. Its slower performance might be due to the heterogeneous integration of SE, OPE and HE. Besides, in our experiments, CryptDB actually failed to complete all operations because of out-of-memory. Specifically, CryptDB crashes when dealing with 2 18 inserted records. Therefore, points for 2 18 and 2 19 records are extrapolated from the measured points for 2 16 and 2 17 for CryptDB in Fig. 7 and Fig. 9.

B. L2, L3 OVERHEAD
Because of the second-stage filtering, we categorize the SQL statements of SELECT with a WHERE clause as having L2 overhead. However, there are two cases to be observed, namely, SELECT by searching among encrypted texts, and SELECT by searching among encrypted numbers.
In case of searching among encrypted texts, uFETCH enables the security agent to retain the handy SQL operator LIKE for wildcard queries. To measure the overhead of such L2 operations, CSP feeds the security agent with the statements of SELECT * FROM club WHERE name LIKE q, with q randomly-picked from inserted names but with zero to four characters replaced by wildcards . (Note that, it is actually the _ used in SQL. is used here for symbol consistency.) For each of the statements, the security agent conducts SQL translation as depicted in Section IV-B and we observe that the steps L2:1∼5 would bring about 100time slowdown, i.e. 1.6 second, comparing to no encryption as shown in Fig. 8 when facing 2 19 records. As CryptDB does not support wildcard query, the best-reported wildcard SE speed, i.e. GPSE [13], is put in for comparison, even though it is not SaaS-native and thus cannot be used in off-the-shelf DBMS. However, given the same 2 19 records, GPSE needs extra 38 seconds to just complete search, i.e. a time that not yet includes record retrieval and decryption.
In case of searching among encrypted numbers, uFETCH enables the security agent to support range query also by the operator LIKE as explained in subsection IV-B. To measure the overhead, CSP produces statements of SELECT * FROM club WHERE age > r1 AND age < r2, with r1 smaller than r2 and both randomly-generated within (0, 100). As shown in Fig. 9, in average, the SELECT with range conducted by the security agent is about 65-time slower than no encryption when facing 2 19 records. However, it is nevertheless about 14-time faster than CryptDB given the same amount of records. Note that, each measured point is an average of 5,000 SELECT statements with randomlygenerated ranges of 10%, 20%, 30%, 40% and 50% wide to the maximum (0, 100), wherein each wideness contributes one-fifth of the samples.
We do not plot the L3 overhead because SQL statements of L3 dynamically depends on how many records are actually updated or deleted. However, since such an overhead can be seen as L2 SELECT plus L1 update or delete, i.e. UPDATE or DELETE with precise identifiers, we consider Fig. 7, 8 and  9 are sufficient for case-by-case L3 estimation.

C. TEXTUAL IR PERFORMANCE
Though externally, CSP always receives from the security agent a result set with perfect IR [4] as expected, it is important to know how the internal IR performs because it implies additional costs of communication, memory and latency for the sake of privacy-preserving. Since uFETCH guarantees a perfect recall, we only have to measure the precision after L2:2.
For searching among encrypted texts, this IR experiment is targeting the field name and is conducted along with the L2 overhead experiment. The underlying dataset [39] provides names with average length of 6.19 characters, i.e. about 2-character shorter than English words. Though uFETCH gives better precision with longer words, the name dataset can be seen as a corner case. However, because we will measure the precision of wildcard queries with up to four wildcards in use, names shorter than five characters are not used as the q to be made into wildcard queries.
As shown in Fig. 10, given 2 10 to 2 19 randomly-picked names, the precision is checked with respect to the number of wildcards in use. Wildcard queries with less than four wildcards are very close to perfect, i.e. 1. However, when the number of wildcards is set to four, the precision drops substantially. It is due to the wildcard propagation mentioned in Section II-B and limits how the security agent can support SQL wildcard % that represents arbitrary number of characters. For example, as % can be implemented by ORing wildcard queries with zero to multiple , the security agent can only support up to four wildcards under this uFETCH instance if a decent precision is required.

D. NUMERICAL IR PERFORMANCE
As uFETCH enables numerical search by means of wildcard queries, its numerical IR precision is related to it. Intuitively, the precision of a range query would drop if the range is too wide as it leads to more wildcards in use.
As shown in Fig. 11, the best precision of queried ranges is about 0.7 with the uFETCH instance. The precision drops when queried range is getting wider. When the queried range reaches 50% of the (max, min), the precision reduces to 0.5. This is actual the lower bound for wide range queries due to too many wildcards in use. In fact, in the case of 50%, all numbers are retrieved back and the precision of 0.5 is the worst one can expect for uniform-distributed numbers.
One might come across an anti-intuition that the precision of 10%-wide range queries is worse than that of 20%. It is actually a phenomenon due to narrow range queries. Their   FIGURE 11. Precision of SQL SELECT with queried ranges that are 10% to 50% wide to the configured (min, max).
precision cannot be properly measured because it highly depends on the configured N and the volume of numbers being indexed. That is, since uFETCH virtually partitions numbers into N + 1 buckets with N secret numbers, nearby numbers tend to be bucketed together. When the trapdoor of a range query matches a bucket, all numbers in the bucket will be retrieved. However, as the query can be made very narrow comparing to the ''nearby'' distance, the precision can be made very poor. Fortunately, this poor precision can be improved by an option depicted below.

E. OPTION FOR NUMERICAL IMPROVEMENT
For clarity, in this work we evaluated the performance of encrypted numerical search by only picking fields of similar ranges, i.e. age and weight. However, for not-similarrange fields such as height and income, the security agent has to set a different (min, max) for each field, leading to multiple uFETCH instances in use. Nevertheless, it is not hard to keep just one uFETCH instance. That is, one can first represent any number by its floating-point representation with mantissa and exponent separately normalized to (min, max) and both indexed by an uFETCH instance. Though this will require two index fields (instead of one) and a bit more complicated SQL translation addressing two fields at the same time, it is worthy because it also improves the precision of the mentioned narrow range queries.

VI. CONCLUSION
This work proposes a unified SE scheme named uFETCH for both textual and numerical data. Using unified index structure and search algorithm, uFETCH enables encryptedsearch even if SE indexes across types are mingled. uFETCH offers efficient selective retrieval by transforming the problem of encrypted-search into a simple problem of subsequence matching for SaaS-native, regardless the encountered dataset contains texts, numbers or both. uFETCH is built for efficiency with a security dedicated to the widely-deployed 3-tier cloud structure.
To show how uFETCH can bring up simplicity, a security agent is demonstrated that translates SQL into three levels of SQL-aware encrypted statements, making existing DBMS privacy-preserving. Our experimental results affirm low overhead and decent IR precision. In fact, comparing with CryptDB, i.e. a popular SQL security proxy, uFETCH brings not only a simpler design, but also an even lower overhead. With adequate search speed and the adequate security, uFETCH is a practical means for CSP to ease compliance with privacy regulations.