A Real-Time Query Log Protection Method for Web Search Engines

Web search engines (e


I. INTRODUCTION
P EOPLE use Web Search Engines (WSEs) for research, shopping, and entertainment [1].Due to the large number of Websites (over 1.7 billion in 2020, according to Internet Live Stats and NetCraft [2]), it would be inconceivable to conduct such activities manually, without the help of a WSE.The usability of WSEs is, moreover, constantly improving.By simply querying the WSE with a few keywords, one may obtain several URLs with the desired contents.However, WSEs are not simply limited to return a list of URLs.When a search is conducted, a query (unstructured data) is processed and stored by the WSE.Together with the query, the WSE will store a timestamp, the URL selected by the user, and any other potential information collected about the user during the search.All this additional meta data, together with the query, is denoted hereinafter as query log.Streams of query logs are processed and analyzed by the WSEs, in order to build and improve users' profiles.This is expected to improve the service offered to users, as follows: • Personalization.The query terms can have multiple meanings.Identifying the sense required by the user represents a challenge.Previous queries submitted by a user can be used to contextualize and disambiguate terms in the future [3], [4].This way, the WSE can prioritize relevant results (e.g., URLs) for the user and show them in the initial positions of the search results.
• Usability.The frequencies and selected results of the most submitted queries are used by WSEs to improve their ranking algorithms [5].This can also be used to suggest alternative queries [6].Such suggestions can show how to correct mistakes when typing, add specificity to the initial query, or provide similar queries with more results.
Search data can also be exploited for other purposes because it reveals powerful insights about customer intent-topurchase and other factors [7].This new exploitation can be conducted by the WSE itself or by a third party, for the following purposes: • Marketing.The results of an advertising campaign can be studied and improved by means of the query logs.
For example, the user can be characterized by their query logs (gender, age, income, education, etc.) and afterwards verify if the advertisements have had an impact on the intended audience (interests and behavior) [8], [9].Besides this, it is possible to extract market tendencies [10].• Research.It may be centered on the study and test of new Information Retrieval (IR) algorithms [11], to learn about user's information needs and query formulation approaches [10].It can also revolve around the use of language in queries [12], among other research topics [13]- [17].
The use of query logs can lead to some problems, related to user's privacy.Each query log can contain a user identifier, a text about what the user is looking for, the time when the search was conducted, and the URLs selected by the user.Any party with access to the query logs can obtain information about a user's behavior, habits, interest and more sensitive information, such as religion or sexual orientation.Even more, some query keywords may contain identifiers and quasi-identifiers [18], which may allow to link queries with real people.This is specially feasible, given current tendencies such as vanity search and egosurfing [19], in which people look for their own names over the Internet.
Query logs can be efficiently protected before being released to third parties.However, faulty or weak protection can lead to serious anonymity issues.The combination of modified data can disclose enough information to re-identify users [9], [20].There is one well-known case, the AOL case, in which around thirty six million records related to query logs from AOL users were publicly released by AOL.Although the records were previously anonymized, it was later shown that it was still possible to identify some of the AOL users via traditional log correlation techniques [21].As a result, sensitive information about AOL users was exposed publicly, without their express consent.The case ended up with an important damage to AOL users' privacy and to AOL's reputation, as well as several class action suits and complaints against AOL [22]- [24].
In this paper, we address the aforementioned problems.We present an anonymization technique to protect query logs at the server side.We assume WSEs seeking to monetize query logs by making them available to third parties, while respecting privacy regulations.A valid approach is to anonymize the logs prior releasing them to the third parties.Just concealing the user identifiers, or replacing them by random information, is not enough [25].A provable anonymization method based on, e.g., Statistical Disclosure Control (SDC) techniques [18], must be conducted to guarantee bounded disclosure risks [26].Traditional approaches can solve this situation by conducting a k-anonymity process at the serverside, before releasing the query logs.The release of data will satisfy the k-anonymity privacy property whenever user data contained within the query logs cannot be distinguished from at least k − 1 other users -whose data also appear in the release [27].
An important issue of traditional k-anonymity approaches is the difficulty of using unstructured streams of data while satisfying the aforementioned privacy properties.This poses an additional problem to WSEs requiring, moreover, realtime processing.We address this issue.Our solution relies on the use of probabilistic k-anonymity to bound disclosure risk of personally identifiable user attributes.Our solution can handle unstructured data, allowing real-time processing of query streams.It provides a probabilistic method to blend streams of queries with high similarity to those requiring protection, but coming from different users.More precisely, it ensures that individuals are not identified with a probability exceeding 1  k , being k the total number of users sharing similar interests to the one meant to be protected (who is also counted in k).With our solution, a WSE can keep the raw query logs and release the anonymized versions to third party organizations.The WSE can also decide to erase the raw query logs and keep only the anonymized versions.Hence, with low utility loss, the WSE reduces the risk of information disclosure in case of intrusions.
Paper Organization -Section II presents our proposal.Section III provides architectural components and requirements.Section IV provides experimentation results validating our approach.Section V surveys related work.Section VI concludes the paper.

II. OUR PROPOSAL
We present in this section our anonymization proposal.Table 1 introduces the notation used along this section.Next, we provide a formal definition of the expected data we aim to anonymize, the way how the data is structured, a formal analysis about the privacy properties of the proposal, and the algorithmic version of our anonymization process.

A. DATA STRUCTURES
We assume a stream of query logs, formed by m registers, where r m corresponds to the last received query log: R = {r 0 , .., r m } (1)

:
τ with a depth and width k Each register is of the form: where u i is a unique identifier that represents the user who sent the query q j to the WSE.Each query q j is composed of a set of unstructured terms, which we previously provided to a categorizer (cf.Section III) to obtain the classification of the query, denoted as c g .This classification c g is represented as the path from a general category γ 1 s to a more specific category γ h s * , with the form: The path is created according to a hierarchical ontology structure by means of a tree structure τ , which is formed by a set of edges e f ∈ E and vertices v h x ∈ V , where h is the depth and x the width.Each vertex v h x of τ represents a category γ h x , and is related to other categories through the edges.The vertices or categories are more generic the closer they are to the roots {v 1 1 ...v 1 x }, and more specific the closer to the leaves.Thus, every query is classified by assigning it to one of the vertices of the tree.As mentioned, the classification is the path between the root and the vertex, and it is composed by all the γ categories of the nodes that are in the path.
The maximum depth of the hierarchy τ is max , defined as the distance or minimum path between the root and its farthest leaf.The number of terms or depth for each classification may be max or lower, but we will use limited versions at depths up to , where goes from 1 to max .
Each vertex v h x contains a set of users U h x , and a set of queries Q h x .The size of U h x will be k, but the size of Q h x may be larger.This is because U is defined using arity, but Q is defined without the need of using arity Therefore, we call τ * ,k the tree τ with a depth and a value of |U | = k.

B. RESTRICTIONS
To properly explain why U h x and Q h x may have different size, we introduce two additional restrictions that we impose to our proposal (cf.Restrictions 1 to 2).Restriction 1.A given query associated to an anonymized log must not be assigned to the same user that issued the query on the unanonymized log.
Restriction 2. When creating an anonymized query log, user must be selected randomly between at least k different user values.
Restriction 1 ensures that outputs do not contain unanonymized pairs of user and query.Restriction 2 imposes probabilistic k-anonymity, setting at least k distinct values for users in each category when randomly creating an anonymized log.

C. ANONYMIZATION PROCESS
We define our anonymization process as the method that generates the probabilistic k-anonymous stream of logs R : R = {r 0 , ..., r m } We assume that each record r j = {u i , q j , c g } in R is assigned to the corresponding v h x using its categorization c g .The record r j is then separated in two parts: u i which is assigned to U h x , and q j which is assigned to Q h x .Records in R are obtained by applying a random match between one element of U h x and one element of where The Id function is assumed to be a correct identification function, which given r j responds with the original u i .The function Re is a re-identification function used over the records in R , which given a r j responds with: The goal of probabilistic k-anonymity is to limit the probability of performing the right re-identification to at most 1 k for all u i ∈ R and for all the values of Re(r j ): The stream of logs R is said to satisfy probabilistic kanonymity if, by knowing R and the anonymization process, the probability to link any record r j ∈ R and its corresponding record r j ∈ R is, at most, 1  k .
We show next that our proposal satisfies the property defined in Eq. (10).For each vertex v h x of τ , the random selection of an element (Restriction 2) guarantees that all outcomes are equally likely to be selected.Therefore, we can state maximum probability of re-identification of a r j over τ using: As U h x sets are defined using arity, we know that: Someone could argue that Restriction 1 leads to a value of k − 1.However, since Restriction 2 establishes this value to k (Restriction 2 also assures that |U h x | ≥ k), the upper bound of our proposal for P (Re(r j ) = Id(r j )) is strictly lower or equal to 1 k , hence satisfying probabilistic k-anonymity.A more formal analysis about this result is provided next.

D. PRIVACY ANALYSIS
Given k (anonymity parameter) in Z + , a set of users U equal to u 1 , ..., u n (such that n ≥ k), a set of query logs Q equal to (u ij , q j ) j j=1 up to the processing iteration j, where q k = q l ∀k, l ∈ [j], (k = l), u ij ∈ U. We also assume that users repeat (i.e. We assume that given a query in R , the whole R and k, an arbitrary PPT (Probabilistic Polynomial-Time) adversary A has at most 1 k chance of guessing the user the given query was attached to in R. Now, with the notation above, and let j 0 ∈ [j] define and experiment Exp Re (k, R), in which: Theorem 1. Anon (cf.Eq.( 13)) is probabilistic k-anonymous if, for every user set, for every query log R and every index j 0 ∈ [j], any PPT adversary A has a bounded advantage up to 1 k , i.e., Proof.Let R = (u ij , q j ) j j=1 and j the iteration at which the first log entry is released by the anonymizer after (u, q) has been read by itself.Let U R j = (u ij 1 , ..., u i J ) be the users presents at R at iteration j and U j = (u i1 , ..., u i k ) be the user set used internally in the anonymizer at iteration j (i.e., we know u ∈ U j ∈ U R j and U j has at least k different users).
If U j and Q j are the users and queries stored by the anonymizer after reading query q, where U j has at least k different users, permute users from the queries of Q j to R (all in U j ) has no effect on the anonymizer output, i.e.: where U j contains the users that can appear in step j, hence u ∈ U.If U j is fixed and u ∈ U j , we can consider an R where the query q is paired with each of the users u of U j , and one of the queries q whence the entries of u from U j are now paired with U .
If we have read j u times the user u, ∀i : j i ≥ 1, we obtain that the ratio of R * s, being R * = Re(R ) and U j = U , which contain the original pair (u, q) is: hence satisfying Theorem 1.

E. ALGORITHMIC VERSION OF OUR PROPOSAL
An algorithmic version of our anonymization process is presented in Algorithm 1. Algorithm 2 presents the anonymization process counterpart, assumed to be implemented by a PPT adversary.Algorithm 1 receives three main inputs: desired k, values, and R as a stream of hierarchically categorized query logs.Even if all the sets are initialized empty, our proposed algorithm guarantees that U h x is of size k every time a new anonymized log is generated from that category.It also tries to keep the Q h x size as close as possible to the k value.As it always chooses between k different users and at least k different queries, probabilistic k-anonymity is guaranteed.
Q h x size may be bigger than k in the following situation: each time a new log enters a category and the log's user was already present on that category, user's arity is increased by one in U h x and the query is added to x | is increased by one.If Restriction 2 is not met, there is no anonymized log release (i.e., the size of Q h x can be bigger than k).
If Restriction 2 is met, and some user's arity is greater than one, then Algorithm 1 releases an additional log to reduce the size of Q and user arity, also enforcing Restriction 1.This extra step is only done once per log, therefore at most two logs are generated each time a new record enters the category, until all users' arities are equal to one.
System performance remains stable whenever variations of the set size is proportionally conducted [28].Hence, we modify the size of each set in incremental unitary steps.This allows the most efficient memory usage.In addition to the k parameter, the depth of categories' tree must be specified  using the parameter.Both k and remain fixed to the specified value throughout the entire execution.Table 2 depicts an example using k = 2 and = 1.These values have been chosen to facilitate the understanding of the example, but they are inferior to desirable values in a real application of the algorithm (cf.Section IV).The example starts with an empty system, receiving a stream R of query logs classified in two distinct categories.Figure 1 depicts the used R, and the contents of τ and R' at the end of the aforementioned example.Figure 2 depicts the deanonymization counterpart, leading to faulty re-identification.

III. PRACTICAL IMPLEMENTATION
We present in this section a practical implementation of our proposal.We describe the architecture and requirements, before moving to the presentation of the experimental results.

A. INITIAL ARCHITECTURE
We aim at implementing an anonymization method that can be used by Web Search Engines (WSEs) to anonymize query logs in a streaming environment, and at server-side (cf. Figure 3).The input data of the anonymization algorithm is a continuous stream of categorized query logs.The outputs are a continuous stream of anonymized logs and a database of user profiles.To meet the goals of our proposal, we must ensure that those outputs meet a set of requirements detailed below.

WSE Anonymizer
Query logs Anonymized logs Profiles FIGURE 3. Our proposal defines a WSE query logs anonymization method in a streaming environment.The input of the algorithm is a stream of query logs.The outputs are a stream of anonymized logs and a database of user profiles.

B. FUNCTIONAL REQUIREMENTS
In addition to the restrictions and properties already defined in Section II, we report next some functional requirements for the practical implementation of our proposal.
Scalability -It refers to the capability of a system to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth [29].In our system, the objective is to achieve load scalability, defined as the ability to accommodate heavier or lighter loads.Those methods can be classified in two main categories [30]: • Horizontal Scalability is related to the ability of a system to add more working nodes, such as a new computer.Hundreds of small computers may be configured in a cluster to obtain aggregate computing power.This approach demands an architecture that allows efficient management and maintenance of multiple nodes.• Vertical Scalability is related to the ability of adding resources to a single node in a system, typically involving the addition of CPUs or memory.Such approach could be interesting in a virtualized environment, as it could provide more resources according to the virtual node needs.This approach demands an architecture that allows efficient management of used processes and memory.The two models have their own particular benefits and limitations.If necessary, our proposal should use all possible assets.In such a case, the design should be integrated into existing systems on a WSE architecture.Ideally, our system can take advantage of underused resources.
Resource Consumption -In order to take advantage of underused resources on existing architectures, and minimize system deployment costs, we want a minimal resource consumption.If the designed system is able to use a limited amount of resources, all necessary data could be kept and processed in memory, obtaining better execution times.
Speed -We need a fast processing speed to be able to process all received logs in real time.Otherwise, some kind of memory buffer will be necessary to keep incoming logs until processed.That buffer will increment our resource consumption.An additional requirement, in terms of processing speed, must be defined and only use small buffers at specific overload times.Nowadays, a WSE receives millions of user queries each hour.Therefore, our system should handle that load, to be able to integrate it in a existing WSE architecture.
Efficiency -Beyond reduced resource consumption and fast processing time, we aim at assuring the algorithmic efficiency of the proposal.We consider that this requirement will be achieved if the algorithmic time complexity of our proposal is linear according to the inputs.
Transparency -We want a straightforward integration of our approach into an existing architecture.Having a transparent system implies that no component of the existing WSE should be modified.For this purpose, our module is expected to be encapsulated within the WSE.It should also be able to interact to the existing interfaces of the WSE, without forcing any changes.It should also be able to generate anonymized logs, while complying with all the previous requirements.
Modularity -We want to have low coupling and high cohesion to achieve a fully transparent component.Modularity has the added benefit that modifications to the proposal could be implemented with minimal effort, as well as to carry out tests with different alternatives for the treatment of the data.

C. EXPANDED ARCHITECTURE
The initial proposal depicted in Figure 3 is expanded with two additional parts: Attacker and Researcher.This allows a proper empirical evaluation, in addition to the analysis conducted in Section II.The proposed system is designed using a micro-service architecture pattern as presented in Figure 4.For the current study, all the defined systems are used.In a real WSE environment, only the parts marked as WSE should be deployed.
Within the expanded architecture, we find two main components: anonymizer and profiler.The anonymizer is a component implementing Algorithm 1.The profiler creates protected user profiles, using the categories of each log assigned to that user by the anonymizer.Those categories are added to a user profile database in real-time.Each profile on the database contains a frequency distribution of those categories queried by the user.They can be seen as user interests that could be released to third parties, for profit.

1) Actors
Three actors are defined in our current test architecture: • WSEhas the responsibility of query logs anonymization and publication.
• Attackerhas access to the anonymized stream of logs, tries to recover the original relationship between the log and the user who made the original query.• Researchercan check all the data, but can not modify anything, to test the validity of the proposal.

2) Phases
Our study is divided into three main phases: • Anonymization and profile creationthis phase represents the normal execution of the system on the WSE environment.It takes the query logs generated and anonymizes them, also generating a database of user profiles.• De-anonymizationit simulates attacks, trying to link as much of the anonymized logs with the user that originally made the query.• Analysisit conducts anonymization, de-anonymization and performance benchmarking, taking into account original and generated data, time and resource usage.

3) Interactions
In a real WSE environment, the WSE will anonymize the query logs and release the anonymized ones to its clients as the main interaction.In our tests, the attacker is acting as a normal client from the WSE point of view.The attacker process the anonymized output of the WSE and generates another log stream, trying to reconstruct original query logs.Only during the tests, secondary interactions occur between those actors and the researcher, who receives original, anonymized and de-anonymized query logs.Some further information about it is presented in the sequel.

IV. EXPERIMENTAL RESULTS
We report in this section a practical implementation of our approach, and report experimental tests and results, to validate our approach in terms of privacy, data utility and other functional requirements.Experiments were conducted using a Dell notebook running Ubuntu Linux 16.04 LTS, with a 1.8 GHz Intel Core TM i7-4500U CPU and 8GB of RAM.System hard disk was a Seagate ST1000LM014, whose performance profile is skewed strongly towards small file I/O, and a below average overall performance.All algorithms were implemented and executed in Python 2.7.12.

A. IMPLEMENTATION
Algorithm 1, described in Section II, has been implemented using the Python language.Input query logs used to test our system were downloaded from the public available AOL log repository, in form of plain text files.In order to respect our transparency functional requirement, we chose to make this file the main input of our system.However, other methods to feed logs to the system, such as a real time input via sockets, could be used.The same applies to system output and we also decided to store them in plain text files, preserving original logs' format.Additionally, a No-SQL database was used to store generated user profiles.
Because AOL's released files do not have any classification, they need to be categorized by an external categorizer before any of the proposed algorithms could be applied.We used a slightly modified version of the deterministic classifier proposed in previous work [28].The use of a deterministic classifier guarantees that the same query will always provide the same unique category.In case a query triggers multiple categories, the classifier will always take the most probable one.Other families of classifiers can be adapted and integrated in our approach thanks to the proposed micro-service architecture.Classifier modifications allow us to obtain a query categorization organized in several hierarchical levels.Some queries contain letters or symbols without any meaning, and some contain no text at all.Our classifier was not able to resolve those logs, and they were left out of data used to test the proposal.However, some changes made to natural language processing algorithms on the classifier lead to categorize 98% of original logs, an improvement of over the 85% categorized in [28].As it is out of the scope of the current proposal, implementation of the classifier will not be evaluated.Priority will be given to allow interoperability between our proposal and different classifiers.Usually, classification process needs more specific data, related to WSE environment or desired output categories.Thus, we leave freedom to each WSE to choose the strategy that best suits their needs.
We also validate the possible record linkage of the anonymized stream, implementing three different record linkage algorithms, and evaluate for each algorithm whose requirements are fulfilled.In addition, some other changes that have been made to the initial architecture described in Section III are discussed below.

B. EVALUATION METHODOLOGY
The algorithmic solution proposed in Section II, and all the architectural components, requirements and implementation details defined in Sections III and IV-A, have been used to conduct an experimental evaluation and comparison to previous work in [28].In particular, one version of the anonymizer, and three versions of the de-anonymizer are implemented and evaluated in terms of utility, privacy and functional requirements.

1) Experimental Datasets
For our experiments, we use plain datasets (i.e., text files), containing query logs released by AOL [31].The released AOL data contains up to thirty six million query logs.Such query logs correspond to a three-month period of real web search activity conducted by AOL users, and released by AOL for research purposes.Figure 5 provides a brief sample of the used logs.The Classifier (cf.Section IV-A), adds to each log record an additional column with a hierarchical classification in form of a list with n elements.In our case, n was between one and 13, and each element of the list represents a subcategory of the previous element.This classification is generated independently of the anonymizer.Therefore, this list contains all the subcategories which the Classifier is able to generate for a given query, regardless of the used by the anonymization process.

2) Conducted Tests
Proposed system could be configured using two parameters: k and , being k the desired number of different users on each category and the maximum depth of categories and subcategories used for each record.Several tests were conducted to determine its effects.
Anonymizerto generate anonymized data, proposed anonymizer was executed on all available AOL logs multiple times, to cover different k and values.k has taken values between 3 and 200 to be able to compare obtained results with previous ones [28].To do this, Algorithm 1 needs to be tested at least using = 1.We decided to test all available values, that with our classification correspond to values between one and 13, but we found that from 11 onwards, differences were not significant: few logs have more than 11 categories of depth.Our privacy, functional and utility requirements are checked for every combination of k and .
Profilerspecific tests were conducted with the profiler, to determine the amount of data utility that could be lost with anonymized profiles creation respect to unanonymized profiles.For those tests, we used k values between three and 90 and values between one and 13.
De-anonymizera de-anonymization has been attempted against all anonymized data.All anonymized data was tested against three different record-linkage algorithms: • Record-linkage 1 -This is the simplest record-linkage algorithm we tested.It tries to apply an inverse transformation to anonymized query logs by applying a similar algorithm to the one used in the anonymization process (cf.Algorithm 1 in Section II).In short, it tries to recreate original logs by randomly matching users and queries from the same category.Attacker also takes advantage of both restrictions 1, 2 to achieve higher levels of de-anonymization.• Record-linkage 2 -It improves the performance over Record-linkage 1. Instead of randomly matching users and queries, it assigns the user that appears more times on a category to the selected query.Just like other algorithms, both restrictions are respected.• Record-linkage 3 -It keeps track of how many times a user issued a query on each category, constantly updating a simplified user profile.When the algorithm needs to assign a user to a query, the user with more issued queries on that category will be chosen.If a user appears more than one time, the result will be multiplied by the number of appearances of that user, balancing the importance between current state of the system and historical values.

C. PRIVACY STUDY
Our privacy test compares original query logs data with the anonymized ones.Results for this base case show that none of the original pairs of user/query appear on the anonymized query log.Notwithstanding, that result did not guarantee full user privacy, since some attacks are possible over the output data flow, and some user logs may be re-identified.Three different record-linkage algorithms were applied to the anonymized query logs (cf.Algorithm 2).Resulting logs were compared to the original ones, counting the percentage of matching records.Our de-anonymization algorithms proposal is based on Algorithm 2, that is similar to Algorithm 1 used in anonymization.It uses the stream of anonymized logs generated by the WSE as the main input.It also needs k and parameters (explained in Section II).The smaller the difference between k and values used in both algorithms, the better the results obtained from de-anonymization.In other words, the attacker will be able to re-identify the original data more easily.
The stream of anonymized logs is classified in the same way as the original one, since we assume that categorization is public and the attacker can use it.Therefore, the deanonymization process uses the same categorization, which enables this algorithm to obtain the best de-anonymization rate when trying to recover the original logs.
The main difference with the anonymizer algorithm is the use of record_linkage function, different for each implementation of the de-anonymizer algorithms.The most complex de-anonymizers also use additional data structures to improve de-anonymization performance.Differences of each algorithm are fully explained in Section IV-B2.
For analysis purposes, we need to evaluate the amount of memory and time used in each algorithm execution, therefore, previous algorithms were modified to calculate those values.An additional algorithm must be defined to find the number of logs that are identical comparing two log streams.
Figure 6 shows percentage of matching records, executing the three algorithms with values of k between three and 200 and values of between one and 13.With = 1, only one level of the tree structure was used, which results in a data structure equivalent to the one used in our former paper [28].
= 13 is the maximum depth that our classifier was able to generate.Thus, there is no need to use higher values.We also picked out k values to be able to compare results between our current and former evaluation.
In all cases, results are under the theoretical maximum probability 1 k of being re-identified [32].We ran the Kolmogorov-Smirnov goodness-of-fit statistical test [33], [34] to compare the k-anonymity probability with the experimental results, Figure 7.The maximum difference between the cumulative distributions, D, is 0.08 with a corresponding p-value of 0.9977.Therefore, the statistical test yields to acceptance of the null hypothesis that our results follow kanonymity's probability of re-identification (at the 5% level of significance).
Each record-linkage version improves re-identification rate, being the third version the one that obtains better results overall.k value was highly correlated with privacy, because when the value of k increases, record linkage decreases.also affects privacy.With a higher number of levels (high value) users were matched with more specific queries, therefore, it was also more probable to obtain a correct reidentification of the original user.Here, we face a trade-off between privacy and data utility.Results obtained this way, are close to the ones obtained in our previous article using the proposed algorithm without restriction, since now the effective size of the category sets are closer to the k value specified as a parameter.However, on average a better anonymization is obtained, since the size of Q must be temporarily increased to meet the restrictions 1 and 2.
Figure 8  algorithms, categories and variables used for anonymization.This ratio decreases quickly when initial k value is increased, obtaining a record linkage lower than 1% from k values greater than 90.In conclusion, desired record linkage level could be adjusted by modifying the k value, even offsetting the effect of variations on the record linkage.

D. UTILITY STUDY
We proceed to analyze the utility of the proposed anonymizer.This analysis has been focused on two different aspects: • Percentage of logs that the system can generate as an output.• Preservation of original user's interest in anonymized user's profiles.
First, we want to analyze the percentage of logs that can be generated by the system over the total number of logs that it gets.The proposed system uses sets, and each set must have at least k different users before being able to release an anonymized log.A possible drawback to this approach is that some sets do not reach k users and, therefore, the logs contained in this set do not end up leaving the system.As we can see in Figure 9, this effect exists and it is directly proportional to the depth of the category tree.This is consistent, since with more depth, more categories are created and the minimum of k users on these categories is reached more slowly.However, we see that as more queries enter the system, all categories become filled with queries and the percentage of log output increases, tending to a 100% rate for any depth of the tree.Output queries vs. total queries (%).Some sets take a while to fill.This effect is directly proportional to the depth of the category tree as more sets need to get k different users.
Secondly, to measure the preservation of original user's interest in anonymized user's profiles, we will measure the distance between them, using a metric known as Earth Mover's Distance (EMD) [35].We calculate the distance between the categories of queries assigned to the original profile and the anonymized profile.As our classification of categories is stored in a tree graph, this distance is defined as the minimum length of the path that connects the categories assigned to the original and anonymized query.Once we have calculated the distance between individual queries, we add all the distances of that profile and, thus, we obtain the total distance between profiles.
Notice that if two queries are classified and anonymized with the same category, there is no distance between the two queries and there is no utility loss.This happens to all the queries when the depth of the tree is set to 13.However, other tree depths can lead to utility loss.For instance, in the example of Table 2, "piano" is classified as "Arts/Music" but the anonymizer is just using "Arts", since the value of is equal to 1. Queries classified as "Arts/Music" and "Arts/Painting" are mixed in "Arts" and assigned to different users.A third party could think that Alice is interested in "Painting", when she is just interested in "Music".i.e., there is a certain degree of utility loss.Since the third party still knows that Alice is interested in "Art", we can see the previous case as an example of partial utility loss.Therefore EMD represents the distance between the original user's interests, and the ones that are deducted from the anonymized queries.
In Figure 10, we can see the average value of the EMD distances, as well as the maximum theoretical distance between profiles using the chosen categorization.This theoretical maximum distance is constant, regardless of which and k values we use.The real distance we get is not affected by k, but is inversely proportional to .This means that the more levels we use in our anonymizer, the closer the anonymized queries get to their original category and we obtain a better data utility.In Figure 10, we can see the loss of utility expressed as a percentage.Using this metric, it can be seen that with = 1, loss of utility is over 40% on average.With = 6, the loss of utility is near to 0%, according to our definition of utility.

E. FUNCTIONAL STUDY
Next, we detail the accomplishment of proposed functional requirements.

1) Modularity
To allow a modular system, this has been designed as a set of micro-services.As our proposal uses micro-service architecture, it will be easier to modify and adapt when applied to different environments.In addition, this design helps each service to focus only on a specific process.By doing so, we achieve a system with low coupling and high cohesion.The anonymization service has been thoroughly explained.This service can be connected to other modules such as categorization and profile creation.

2) Scalability
The proposed system can be scaled, both vertically and horizontally.Vertical scalability is achieved by varying the number of resources assigned to the system.These resources can be added either in form of memory or CPU cycles.Horizontal scalability can also be achieved by activating or deactivating different instances in parallel.In addition, with the proposed anonymizer, the value of k could be dynamically adjusted, which also allows to improve the scalability of the system using it in a wider range of situations.

3) Speed
Speed of the anonymizer and deanonymizers was tested.All the results that are shown correspond to the time required to completely treat a query using a single thread of execution on a single core.All the proposed algorithms can be used in parallel, achieving a better system throughput.
The fastest execution was achieved with k = 3 and = 1, where on average a query was processed in 18.99 µs.Therefore, the system can handle up to 52659 queries per second, on average.Average processing time per query was 33.68 µs, or 29691 queries per second.It includes executions with all the k and values we have tested.Compared to our previous proposal where we obtained 22 µs per query, we see that the system is slower on average, but with greater data utility.However, depending on which parameter values are used, the system is faster than our previous proposal, as described below.Speed of the anonymizer is affected by k and .If we look at Figure 11, we can see that changes in the value of have little effect on required time.Contrarily, changes in the value of k have an important effect.For example, for k = 3 the system can process a log in about 18.99 µs.This value reaches 49.71 µs with a a value of k = 190.Taking into account that Google treats an average of 40000 queries per second (cf.Ref. [36] and citations thereof), a thread of our algorithm could handle all real-time queries, using k-values up to 50 with any value of , according to our test results.
The same analysis has also been done with proposed deanonymization algorithms.Results can be seen in Figure 12.
The first de-anonymizer approach obtained results comparable to the anonymizer.This was expected since in both cases the same base algorithm was used.Second and third deanonymizers, which perform more complex operations, are also slower and more affected by increases in k-values.In all cases, we see that variations of -values are less important.

4) Delay
Another factor that we consider important to evaluate is the average delay of queries between entering and leaving the system in form of anonymized query logs.Figure 13 shows this delay as the mean number of other queries that enter the system during the period between the entry and the release of a given query.As we can see, this delay is increased proportionally to the chosen -value, but it ends up stabilizing.This is reasonable, since the system needs to fill categories initially and once this happens, the output stabilizes.
Taking as reference the 40000 queries per second that Google receives (according to Ref. [36]), we see that our system's output stabilizes in a few minutes for larger values of .Once the delay is stable, our system takes less than one second for values ≤ 6, and does not reach two seconds for larger values of .

5) Resource Consumption
Notice that our algorithms do not use any disk space, therefore only memory consumption needs to be evaluated.
We have identified the variations in -value as the main parameter that affects resource consumption.Memory consumption increases when a new level of depth is added to the tree, in proportion to the number of effective categories that are added (cf.Table 3).Categories were created dynamically, depending on query's classification, therefore a different data set will generate different categories.At the end of our tests, we used a maximum of 194505 categories, in a tree with depth thirteen.
With our test data, we see that most records are classified at depths between five and seven, although we found a maximum depth of thirteen.As we increase depth, there FIGURE 12. De-anonymizer mean time per query (µs).First de-anonymizer obtained comparable results to the anonymizer.Second and third de-anonymizers, which are more complex, are also slower and are more affected by increases in k-value.Queries delay, as the mean number of other queries that enter the system during the period between the entry and the leave of a given query.
Once the categories are full, the output stabilizes.
are fewer queries that can be classified at the last levels, using the same data and the same classifier.Although we increase the value of the effective number of categories created is marginally increased from this point.This also causes memory consumption to stabilize.Let us illustrate the previous observation with an example.Given a query classified as "a:b:c:d:e" if we use an equal to 4, the level 4 vertex "a:b:c:d" is used for anonymization.If we increase to 5, or a higher value, we use for anonymization the complete category, i.e. level 5 vertex "a:b:c:d:e", even if we use an = 13.
On the other hand, we can see that k adds a multiplicative factor in the consumption of resources, depending on the number of existing effective categories.The results in Figure 14, only show the maximum memory consumption.
Regarding different algorithms set forth, both anonymizer and de-anonymizer 1 show the same memory consumption profile.De-anonymizer 3 is the algorithm with higher mem-  ory consumption.This is because that algorithm creates user profiles in memory and therefore is reasonable that it uses more resources.Anonymizer and de-anonymizers 1 and 2 should not use more memory than the reported, regardless of the volume of logs they deal with.However, this is not the case of deanonymizer 3, as when it creates new user profiles, it increases the memory consumption.

6) Efficiency
As we have seen in the previous sections, a lightweight method has been defined.It allows the logs to be quickly processed with reduced resource consumption.
Studying the anonymizer we see that both delay and memory consumption vary initially, because the system starts empty and the sets must be filled.As we have seen, once the sets achieve k elements, these values stabilize.On the other hand, the processing speed of a log depends on the value of k and , but it remains constant throughout each test set.
Analyzing the proposed algorithm, we can see that each log is only treated once.This allows us to equate its efficiency with well known singly-linked list traversal algorithms.Therefore, the algorithmic time complexity of our proposal is linear regarding to the input and could be established as O(n).

FIGURE 14.
The value of is the main parameter that affects memory consumption.The value of k adds a multiplicative factor.Both the anonymizer and de-anonymizer 1 show the same memory profile.De-anonymizer 3 is the algorithm with higher memory consumption, because it creates user profiles in memory.

7) Transparency
The input of the system should be a stream of classified query logs that can be obtained from the WSE.In case that only unclassified logs are available, a classification micro-service could be implemented and added to the WSE architecture, as we previously showed in Ref. [28].In case that classified logs are available, those logs could be used without further modifications.Our system generates an anonymized stream of logs, preserving the existing structure.From the point of view of an existing client, generated output will be completely indistinguishable of the original one.Therefore, total transparency is reached.

V. RELATED WORK
Our work relates to the use of privacy-enhancing technologies (PETs) applied to the web search paradigm.Figure 15 shows and positions a classification of PET proposals designed to protect the users' privacy in front of WSEs -on the basis of previous classifications [37]- [39].The classification identifies two main actors: users and WSEs.The first group contains proposals that protect users' privacy at the WSE side, without the need for users' participation.They are asynchronous and transparent to the users.Our proposal falls under this first category.The second group includes approaches that protect users' privacy without any help from the WSE, i.e., when users do not require any changes at the server side of the WSE.The third group comprises approaches that require a certain level of cooperation between users and WSEs.The latter are not considered as server-side, since users actively participate in the process -when WSEs do not cooperate, it is assumed that users immediately detect them.In the sequel, we report related work under all three categories.

A. SURVEY ON SERVER SIDE PROPOSALS
WSEs aim at anonymizing data while minimizing information loss, for profit purposes.Our work is focused on this assumption.The goal is to commercialize releases of the protected set of query logs to third-parties.Anonymization solutions to reach such a goal can get classified according to anonymizaiton inputs.Most solutions are either processing fixed-length (e.g., block-based) or data-stream inputs.

1) Fixed-length Inputs
In the case of fixed-length inputs, existing proposals consider a set of finite and static data structures.Each set contains all the elements to be anonymized.The protection of the whole dataset is conducted as a two-step process, first analyzing all the dataset elements, then processing them.Some representative solutions under this category are presented next.
Suppression -The anonymization of the dataset is conducted by eliminating those elements which, in isolation or combination, may reveal sensible information.The analysis of the dataset assumes either statistic or semantic methods, to identify which elements require suppression.
Examples of suppression under the context of query logs anonymization exist in the related literature [40].The deletion of identifiers such as social security numbers, physical addresses, bank accounts or any another identification data related to the user, are traditional examples of suppression in the literature [41].Nevertheless, the AOL incident reveals the limitations of this approach [22]- [24].The existence of quasi-identifiers in the AOL dataset, and the complexity of identifying their combinations, were proven enough to re-identify AOL users via traditional log correlation techniques [21].
The suppression of infrequent queries is another approach [13].It aims at suppressing those queries that are likely to contain identifying or quasi-identifying information.The approach requires the definition and accomplishment of thresholds.Since queries may appear only a limited number of times [14], the elimination of a significant number of non-identifying queries becomes a complex and errorprone task.The approach can be complemented by selecting those queries resulting from clicking on common URLs, i.e., by establishing a correlation between clicking and quasiidentifiers [10].Another possibility is the representation of query logs using graph theory [9].Nodes are seen as user queries.A query is connected to other user queries  whenever the intersection of their clicked URLs sets is non empty.The anonymization process is done by iteratively suppressing those queries that return less than k documents.Those queries that considerably contribute to the query graph (i.e., queries with partial or full target URLs) are considered vulnerable and suppressed.
Generalization -Another approach used to provide anonymity is based on the generalization of domain relationships, i.e., by analyzing the values that the associated attributes can assume.The concept of minimal generalization seeks to maintain the lowest possible distortion levels of the processed datasets [42].Top-down approaches, using lexical and semantic databases to conduct general-purpose generalizations have also been proposed [43], [44].The idea is to transform groups of input queries to common conceptual abstractions (e.g.football and tennis as sports), in order to make users who performed similar queries indistinguishable.The main limitations associated to these approaches rely on the construction of generic dictionaries associated to those words or concepts to anonymize.This may require, moreover, specific adaptations based on the language used on the original datasets.
k-anonymity -The property of k-anonymity [27] was proposed to minimize the risk of record-linkage.A kanonymized dataset has the property that each record is indistinguishable from at least k − 1 other records.This way, no individual can be re-identified with probability exceeding 1 k through linking attacks.
Current approaches propose methods of Statistical Disclosure Control (SDC) to transform query records into anony-mous logs, while reducing the amount of query deletion [45], [46].Logs of similar queries are used to group users, and later their queries are rewritten by a prototype query.This makes them indistinguishable [47]- [51].Users and queries are conserved, although queries are transformed to reduce the risk of disclosure.Similar approaches propose the generation of fake messages to mix them with the legitimate ones [52] or masking infrequent queries using a more general frequent query [53] to achieve levels of privacy comparable to kanonymity.
Differential Privacy -Initially described as a solution to manage the risk of identifying users participating in a given dataset [54], interactive scenarios of the same approach do also exist [55].The initial scenarios associated to differential privacy expect queries accessing partial information of the dataset.However, when intelligently conducted, such queries may end up revealing information from the original users.For that reason, interactive improvements are expected to evaluate how far queries get through, to deny responding whenever a limit is bypassed.Since the protected outputs may still preserve some statistics (e.g., query suggestions and spelling corrections), extended proposals aim at further limiting the risk of information disclosure in such returned statistics [10].
Authors in [56] propose a technique in which samples with high utility are selected to become the representative records in each cluster, i.e., to achieve the objective of leaking less privacy and releasing more useful information.Other proposals [57], [58] pose the addition of Laplacian noise to the logs, to preserve privacy.However, the more noise is added, the more data utility gets reduced.

2) Data-stream Inputs
This approach allows to treat data partially.The system does not need all the data to start dealing with.It also makes possible a partial treatment of the data.This approach is able to generate data outputs with a minimum delay [59].In addition, it also opens the doors to deal with very large datasets, even infinite ones.Still, protecting the privacy of very large data streams continues to have some difficulties [60].Next, we survey some representative solutions under this category.
Rank Swapping -The method was first described for numerical variables [61], although initial ideas associated to swapping data exist in other previous areas [62].We can also find other approximations [63], [64].In all such cases, the proposals only consider structured data.This is because the data is sorted by the value of an attribute and then exchanged with a randomly selected value (the nearest ones in the rank) [65].
Differential Privacy -The differential privacy approach can also be applied to anonymize data-streams [66].In this case, there is no release of the original query, but a synthetic one, obtained using semantic similarity.The lack of structure in query logs, combined with new terms which may not be present into the semantic database, could represent a challenge for this approach.Another limitation using differential privacy in a streaming environment is to maintain a fixed privacy level.It is possible that no more data can be published in order to preserve the privacy of users.
Probabilistic k-anonymity -The concept of probabilistic k-anonymity relaxes the indistinguishability requirement of k-anonymity [67].It only requires that the probability of reidentification is maintained, with regard to the case of kanonymity.By relaxing the indistinguishability requirement, a better use of the data may be accomplished.Moreover, logs can be released containing the original queries.On the negative side, given the continuous generalization of unstructured dataset elements, a certain imprecision is added to the generated profiles.Existing limitations in the related literature [28], [68] is in terms of classification methods, which are very basic.Hence, the number of resulting categories is low, leading to higher degrees data utility loss.

B. SURVEY ON CLIENT SIDE PROPOSALS
One may argue that WSEs have no motivation to protect the privacy of users.Indeed, users may be seen as the only interested party responsible to protect data privacy.Under this assumption, we find some protection approaches which do not expect any collaboration between WSEs and users.Such approaches can be classified in two main categories: i) obfuscation techniques and ii) anonymous channels.Obfuscation techniques generate noise to distort the user's profile managed by the WSEs.Anonymous channels assume an infrastructure between users and WSEs to handle the profiling of activities.The use of client side techniques are assumed to generate non-realistic profiles that may have an adverse effect on the services provided by WSEs.

1) Obfuscation Techniques
Early techniques assume the introduction of random queries (e.g., fake queries), in order to obscure users' profiles.Random queries must be indistinguishable from the real queries.This property is known as unobservability.Representative solutions based on obfuscation techniques can be classified according to the number of users that participate in the protocol.We have standalone solutions and distributed solutions.Standalone solutions assume individual users handling their own privacy in front of the WSEs.Distributed solutions assume groups of users working together to protect the privacy of each user.Next, we provide some examples for each category.
Standalone Systems -These schemes generate synthetic queries that are used to hide the real queries of the users [69]- [77].Synthetic queries are submitted together with the real queries, obfuscating the profiles that the WSE owns for each user.If the synthetic queries are in some way semantically related to the user's queries, the obfuscated profile will still be usable, i.e., the WSE will be able to personalize the user's results.When the synthetic queries are semantically unrelated to the user's queries, the profile will be heterogenous and the personalization will be less accurate.This does not mean that one alternative is better than the other, since users may have different preferences regarding of the trade-off between privacy and utility.Some works show that it is possible to distinguish real queries from synthetic queries [73], [78]- [80].These works rely on the idea that machine-generated queries do not have the same features as human-generated queries.
Distributed Systems -These schemes require the collaboration of a group of users that work in partnership to protect their privacy, i.e., they hide their actions within the actions of many others [81]- [87].Typically, these schemes put users into a large group where they submit requests on behalf of other members.Users exchange their queries.Personalization is only possible if the members of the group share the same interests [37].In some proposals [81]- [83], there is a central node that poses a bottleneck in the overall system performance.In other cases, one type of path [81], [84]- [87] is created to submit the query or a group of users must be created [81]- [83].In both cases, a significant delay is introduced [37].

2) Anonymous Channels
The proposals under this category use anonymous infrastructures [88], [89] in order to send users' queries to the WSE.By concealing users' identity associated to the queries, WSEs are assumed to be unable to profile users.However, this may affect the quality of the service offered by the WSEs to the users.
Chaum's mix networks [90] are representative cases of solutions under the category of anonymous channels.Messages pass through several nodes.Each node disassociates the input messages from the output messages, by means of cryptography [88], [89].Evolved techniques assume the use of proxies [91], relying connections (e.g., queries) from users to the recipients (e.g., the WSEs).The key concept is that proxy delivers the messages but does not disclose the source (e.g., the user' identity).DuckDuckGo1 , Start Page2 and Yippy3 are some significant examples using proxy-like infrastructures.By using these solutions, users transfer their trust from WSEs to the proxies (i.e., users must assume that proxies do not monitor or log their traffic).
Web MIXes [92] provides anonymous and unobservable real-time Internet access.It incorporates an authentication mechanism in order to prevent flood attacks.Additionally, it includes a feedback system with an interface that informs users about their current level of protection.However, some flaws in their authentication process may allow external attackers to perform replay attacks [93].The synchronous nature of Web MIXes may also end in problems when dealing with asynchronous TCP/IP networks [94].
The use of onion routing [95] to establish anonymous channels under the context of queries and WSEs has also been proposed in the literature [96].General purpose plugins, and modified web-browsers4 using the Tor Project [97], are user-friendly solutions based on the onion routing paradigm.Nonetheless, several weaknesses have been reported [98].Tor does not attempt to offer security against passive global adversaries [89].Similarly, the Invisible Internet Project (I2P) [99] builds an anonymous network layer designed to be used for anonymous communication.

C. SURVEY ON COLLABORATIVE WSE-CLIENT PROPOSALS
Solutions under this category assume that users and WSEs work together in order to protect users' privacy.Next, we report solutions under this category in three main groups: i) Private Information Retrieval; ii) Platform for Privacy Preferences (P3P); and iii) Context-based Retrieval.

1) Private Information Retrieval
Private Information Retrieval (PIR) schemes [100]- [103] enable users to obtain information from a database privately, i.e., the server cannot know what information was retrieved.Through a PIR scheme, users can search the documents stored in the database and recover those of their interest.The problem of submitting a query to a WSE while preserving the user's privacy is equivalent to the PIR problem.However, PIR schemes suffer from two practical problems that make them not appropriate for WSEs [82]: PIR schemes are not suitable for large databases, and users are assumed to know the precise location of the records to be recovered.

2) Platform for Privacy Preferences (P3P)
The Platform for Privacy Preferences (P3P) [104], [105] was created by the World Wide Web Consortium (W3C) with the objective of making easier for users to obtain information about the privacy policies of the sites that they visit.P3P is a framework through which users can automate the protection of their privacy.They can define their privacy preferences and, when a website does not conform to these preferences, then P3P-enabled browsers may alert the user and even take pre-established actions (e.g., deny access to cookies).The Do-Not-Track initiative [106] is a policy-based P3P system in which HTTP headers request web applications not to track users.The web application must be P3P-complaint in order to be effective.It has been studied in several works [107]- [109] and standardized by W3C.However, it is considered as an obsolete protocol nowadays.In fact, P3P-like solutions have been criticized due to the impact that governmental laws may have over users [110], the lack of follow-up from websites w.r.t.privacy-protection mandates in their legal jurisdictions (e.g., compliance difficulties of websites to enforce their own privacy policies) [111], and low number of potential adopters [112].

3) Context-based Retrieval
Context-based retrieval proposals aim at storing user profiles (e.g., search history) on the client's machine.This information allows to obtain users' interests and re-rank search results according to them.WSE and users participate together in the searching process in order to obtain the final results, i.e., the WSE receives the query and returns the results.Then, these results are re-ranked at the client-side.The User-Centered Adaptive Information Retrieval (UCAIR) project [113] collects and exploits available user context from submitted queries and clicked results.Similar schemes allows users to choose the content and degree of details of their profiles exposed to the WSE [114]- [116].In the end, users determine the profile content that is revealed to the WSE when a query is submitted.The adjustment of parameters associated to the stored profiles is possible, in order to improve the quality of the results.Potential disadvantages of these proposals relate to performance and effectiveness limitations of results ranked at the client (i.e., much less effective than ranking the results at the server side) [113].Moreover, it is expected that WSEs can still profile users after several executions of the approach.

VI. CONCLUSION
A formal approach for the anonymization of WSE query logs has been presented.Our proposal allows to publish query logs without any other modification than eliminating direct identifiers and equivalent user re-assignment categories.This contrasts with existing approaches that release heavily modified data, either distorted or generalized, to maintain anonymity.In addition, our proposal allows some degree of configuration, using two main parameters: • k to adjust the level of diversity on each category.
• to adjust the amount of available categories.This parameterization allows to adjust privacy and utility levels of generated logs according to the needs of each application.
Three algorithms have been evaluated performing an attack to the anonymized data, using the most favorable scenario for the attacker, i.e., when the attacker knows the algorithms used by the WSE, all the parameters and the data.The attacker has access to the anonymized log stream, but not to the original logs.Tests with this context and several values of k and were conducted.
Our best record-linkage attempt re-identified 23.18% of original logs with the lowest k-value, highest -value and using the most complex record-linkage algorithm, which is also the one that needs more resources.With the same parameters, using the simplest record-linkage algorithm we get an 18.36%.These results are reduced rapidly, recovering less than 1% of original logs when using values of k over 100.
Variations in the values of do not have a representative impact in terms of record linkage, but they do offer a significant improvement in terms of data utility.
Our proposed ideas were tested using the AOL released logs, showing the feasibility of our solution over real environments.The application of our work is sufficient to generate anonymized logs that meet representative criteria, e.g., release of anonymized data to third parties.Our solution can handle the equivalent to Google's average load, using only one execution thread per testing environment.To evaluate log's utility after anonymization, we have measured distances between user profiles using Earth Mover's Distance.We have found that using an -value of one, a 42.03% of utility was lost.Using -values of six or more, less than 1% of utility was lost.
There are several avenues for improving our work.Additional categorizers may be proposed, for example using artificial intelligence systems to perform query analysis.Another improvement is to consider dynamic -values, both globally or for some specific category branches.System performance could also be tested in a distributed node environment, where each node is responsible for processing a part of the queries.A real-time record linkage analysis could be added to ensure that we only publish records that meet a certain threshold of privacy.Finally, some experiments could be conducted with queries' time, both with anonymization and de-anonymization algorithms, to improve their performance.

FIGURE 4 .
FIGURE 4. Full Architecture: WSE Anonymizer takes a stream of query logs and anonymizes them, also generating a database of user profiles.It implements Algorithm 1 (cf.Section II).De-anonymizer implements Algorithm 2 and simulates adversarial actions over the anonymized logs.It tries to recreate the original logs and user profiles.Profile matcher, responsible of benchmarking anonymization, de-anonymization and performance, also generates a profile utility metric.

FIGURE 5 .
FIGURE 5. AOL log format.Each row represents a query log.Columns contain, from left to right: user identifier, query submitted, time submitted, result selected and result URL.

FIGURE 6 .FIGURE 7 .
FIGURE 6. Record linkage (%).Percentage of matching records, executing the three de-anonymization algorithms with values of k between three and 200 and values of between one and 13.).

12 FIGURE 9 .
FIGURE 9. Output queries vs. total queries (%).Some sets take a while to fill.This effect is directly proportional to the depth of the category tree as more sets need to get k different users.

FIGURE 10 .
FIGURE 10.(a) Maximum theoretical distance between profiles, constant, and average EMD distances, inversely proportional to .(b) Loss of utility (%), the more levels we use in our anonymizer, the better data utility.

FIGURE 11 .
FIGURE 11.Anonymizer mean time per query (µs).-value has little effect on required time, k-value has a greater effect.
FIGURE 13.Queries delay, as the mean number of other queries that enter the system during the period between the entry and the leave of a given query.Once the categories are full, the output stabilizes.

FIGURE 15 .
FIGURE 15.Classification of Web Search Privacy Enhancing Technologies (PETs).

TABLE 1 .
Notations used in this paper.
R: Stream of query logs r j : Individual query log u i : User unique identifier q j : Individual query text cq : Full individual query classification τ : Hierarchical ontology of categories 1 s , ..., γ s * } ∈ c;

TABLE 2 .
Applying Algorithm 1 with k=2 and =1 VOLUME 4, 2016FIGURE 1. Contents of R, τ and R in the example provided inTable 2. ' FIGURE 2. Contents of τ and R * when trying to deanonymize R from the example provided in Table 2.
shows mean final | Q | values, related to and initial k value.For low values, mean final | Q | values are higher because they have less categories results and more user coincidences on the same category.However, with small k and values, the high number of queries that passes through each category counter this effect.With higher values, final | Q | values tend to match up with specified k.The highest record linkage is obtained with highest and lowest k values.Our best de-anonymizer algorithm was able to link 23.18% records to the original user.Deanonymization tests were conducted knowing exactly all Final | Q |-value, as the mean size of queries' sets.For low values, final | Q | is higher due to more user coincidences on the same category.With higher values, final | Q | tend to match the specified k.

TABLE 3 .
Number of categories added with each increase in -value and total categories of a tree with depth.Although we found a maximum depth of thirteen, we see that most records are classified at depths between five and seven.