A Survey of Approximate Quantile Computation on Large-scale Data (Technical Report)

As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.


I. INTRODUCTION
Data profiling is a set of activities to describe the metadata about given data [1], [2]. It is crucial for data analysis, especially for large-scale data. It helps researchers to understand data distribution [3], discover duplicates [4], [5], detect anomalies [6], determine thresholds [7], [8], etc. Such information provides guidance for other data preprocessing work such as data cleaning [9], [10], which can subsequently improve the performance of data mining dramatically [11]. When preprocessing large-scale data, data profiling is attached great importance to and faces its own challenges. Because of large data size, classic brutal methods are not applicable any more for their intolerable complexity of both time and space. Researchers have spent decades on figuring out new ways to compute the metadata which can be calculated easily on small data. The metadata can be divided into two categories based on scalability: aggregation statistics and order statistics [12].
Aggregation statistics are named for their property that they are mergeable and incremental, which makes them relatively easy to be computed no matter how large the data is. For examples, sum, mean values, standard deviations, min or max values are all aggregation statistics. For streaming models [13], [14], where data elements come one by one with time, we can trace and update aggregated results covering all arrived data by incrementing new results continuously. Time complexity and space complexity are both O (1). As for distributed models [15], where data are stored in nodes of a distributed network, the overall aggregation statistics can be obtained by merging results from each node. The total communication cost of this computation is O(|v|), where |v| is the number of network nodes. However, order statistics, such as quantiles, heavy hitters, etc., do not preserve such property. So, we cannot compute them by merging existing results with newly produced results in a straight way. In order to compute them, many customized data structures or storage structures are proposed for these order statistics, trying to turn them into a mergeable or incremental form in some way.
In this summary, we focus on one order statistic, quantiles. They help to generate the description of the data distributions without parameters. In other words, they are able to reflect the cumulative distribution function (cdf), thus the probability distribution function (pdf), of data at low computational cost. Pdf is widely used in data cleaning and data querying. For example, in data cleaning, it is applied to demonstrate the distance distribution among values of the same attribute so as to identify misplaced attribute values [16]. And in data querying, it helps to set an appropriate correlation filter, improving efficiency for set correlation query over set records in databases [17]. Therefore, quantiles are regarded as one of the most fundamental and most important statistics in data quality analysis in both theory and practice. For instance, many data analysis tools, including Excel, MATLAB, Python, etc., have quantile-computing functions as built-in components or libraries. In the Sawzall language, which is the basic for all Googles log data analysis, quantile is one of the seven basic statistic operators defined, along with sum, max, top-k, etc. [18]. Besides, quantiles are widely used in data collection and running-state monitoring in sensor networks [19], [20]. When a dataset contains dirty values, compared with mean values and standard deviations, quantiles and median absolute deviations are more objective and more accurate to reflect data center and data deviation [21]. They are less sensitive to outliers. In temporal data, where imprecise timestamps are prevalent, even if some timestamps are delayed very long or have inconsistent granularity, quantiles are still able to specify appropriate temporal constraints on time interval, helping to clean the data [22]. In addition, quantile algorithms have been widely used as subroutines to resolve more complicated problems, such as equi-depth histograms [23] and dynamic geometric computations [24].
A quantile is the element at a certain rank in the dataset after sort. Algorithmic studies can be traced back to 1973 at least when linear-time selection was invented [25]. In classic methods of computing φ-quantile over a dataset of size N , where φ ∈ (0, 1), first we sort all elements and then return the one ranking ⌊φN ⌋. Its time complexity is O(N log N ) and space complexity is O(N ) obviously. However, in large-scale data, the method is infeasible under restrictions of memory size. Munro et al. has proved that any exact quantile algorithm with p-pass scan over data requires at least Ω(N 1/p ) space [26]. Besides, in streaming models, e.g., over streaming events [27], quantile algorithms should also be streaming. That is, they are permitted to scan each element only once and need to update quantile answers instantaneously when receiving new elements. There is no way to compute quantiles exactly under such condition. Thus, approximation is introduced in quantile computation. Approximate computation is an efficient way to analyze large-scale data under restricted resources [28]. On one hand, it raises computational efficiency and lower computational space. On the other hand, large scale of the dataset can dilute approximation effects. Large-scale data is usually dirty, which also makes approximate quantile endurable and applicable in industry. Significantly, the scale of data is relative, based on the availability of time and space. So, the rule about how to choose between exact quantiles and approximate quantiles differs in heterogeneous scenarios, depending on the requirement for accuracy and the contradiction between the scale of data and that of resources. When the cost of computing exact quantiles is intolerable and the results are not required to be totally precise, approximate quantiles are a promising alternative.
We denote approximation error by ǫ. A ǫ-approximate φquantile is any element whose rank is between r − ǫN and r + ǫN after sort, where r = ⌊φN ⌋. For example, we want to calculate 0.1-approximate 0.3-quantile of the dataset 11,21,24,61,81,39,89,56,12,51. As shown in Figure 1, we sort the elements as 11,12,21,24,39,51,56,61,81,89 and compute the range of the quantile's rank, which is [(0.3 − 0.1) × 10, (0.3 + 0.1) × 10] = [2,4]. Thus the answer can be one of 12, 21, 24. In order to further reduce computation space, approximate quantile computation is often combined with randomized sampling, making the deterministic computation becomes randomized. In such case, another parameter δ, or randomization degree, is introduced, meaning the algorithm answers a correct quantile with a probability of at least 1 − δ. There are 3 basic metrics to assess an approximate quantile algorithm [29]: • Space complexity It is necessary for streaming algorithms. Due to the limitation of memory, only algorithms using sublinear space are applicable [30]. It corresponds to communication cost in distributed models. In a distributed sensor network, communication overhead consumes more power and limits the battery life of power-constrained devices, such as wireless sensor nodes [31]. So, quantile algorithms are aimed at low space complexity or low communication cost.
• Update time It is the time spent on updating quantile answers when new element arrives. Fast updates can improve the user experience, so many streaming algorithms take update time as a main consideration. • Accuracy It measures the distance between approximate quantiles and ground truth. Intuitively, the more accurate an algorithm is, the more space and the longer time it will consume. They are on the trade-off relationship. We use approximation error, maximum actual approximation error and average actual approximation error as quantitative indicators to measure accuracy. We collected and studied researches about approximate quantile computation, then completed this survey. The survey includes various algorithms, varying from data sampling to data structure transformation, and techniques for optimization. Important algorithms are listed in Table I. The remaining parts of this survey are organized as follows: In Section II, we introduce deterministic quantile algorithms over both streaming models and distributed models. Section III discusses randomized algorithms [32]. Section IV introduces some techniques and algorithms for improving the performance and efficiency of quantile algorithms in various scenarios. Section V presents a few off-the-shelf tools in industry for quantile computation. Finally, Section VI makes the conclusion and proposes interesting directions for future research.
II. DETERMINISTIC ALGORITHMS An algorithm is deterministic while it returns a fixed answer given the same dataset and query condition. Furthermore, quantile algorithms are classified based on their application scenarios. In streaming models, where data elements arrive one by one in a streaming way, algorithms are required to answer quantile queries with only one-pass scan, given the data size N [33] or not [34], [36], [37], [47], [48]. Except to answering quantile queries for all arrived data, Lin et al. [35] concentrates on tracing quantiles for the most recent N elements over a data stream. In distributed models, where data or statistics are stored in distributed architectures such as sensor networks, algorithms are proposed to merge quantile results from child nodes using as low communication cost as possible to reduce energy consumption and prolong equipment life [20], [38], [40], [42], [45].

A. Streaming Model
The most prominent feature of streaming algorithms is that all data are required to be scanned only once. Besides, the length of the data stream may be uncertain and can even grow arbitrarily large. Thus, classic quantile algorithms are infeasible for streaming models because of the limitation of memory. Therefore, the priority of approximate quantile algorithms for streaming models is to minimize space complexities. In 2010, Hung et al. [49] have proved that any comparison-based ǫapproximate quantile algorithm over streaming models needs space complexity of at least Ω( 1 ǫ log 1 ǫ ), which sets a lower bound for these algorithms.
Both Jain et al. [50] and Agrawal et al. [51] proposed algorithms to compute quantiles with one-pass scan. However, neither of them clarified the upper or lower bound of  [33] Deterministic Streaming O( 1 ǫ log 2 (ǫN )) GK01 [34] Deterministic Streaming O( 1 ǫ log(ǫN )) Lin SW [35] Deterministic approximation error. In 1997, Alsabti et al. [47] improved the algorithm and came up with a version with guaranteed error bound, referred to as ARS97. Its basic idea is sampling and it includes the following steps: 1) Divide the dataset into r partitions.
2) For each partition, sample s elements and store them in a sorted way. 3) Combine r partitions of data, generating one sequence for querying quantiles. ARS97 is targeted at disk-resident data, rather than streaming models. Nevertheless, its idea to partition the entire dataset and maintain a sampled sorted sequence for quantile querying inspires quantile algorithms over streaming models afterwards.
The inspired algorithm is referred to as MRL98 proposed by Manku et al. [33]. It requires the prior knowledge of the length N of data stream. Similar with ARS97, MRL98 divides the data stream into b blocks, samples k elements from each block and puts them into b buffers. Each buffer X is given a weight w(X), representing the number of elements covered by this buffer. The algorithm consists of 3 operations: • NEW Put the first bk elements into buffers successively and set their weights to 1. • COLLAPSE Compress elements from multiple buffers into one buffer. Specifically, each element from an input buffer X i would be duplicated w(X i ) times. Then these duplicated elements are sorted and merged into a sequence, where k elements are selected at regular intervals and stored in the output buffer Y , whose weight w(Y ) = i w(X i ) • OUTPUT Select an element as the quantile answer from b buffers.
NEW and OUTPUT are straightforward, so the algorithm's space complexity depends mainly on how to trigger COL-LAPSE. MRL98 proposed a tree-structure trigger strategy. Each buffer X is assigned a height l(X) and l is set to min i l(X i ). l(X) is set as the following standard: • If only one buffer is empty, its height is set to l.
• If there are two or more empty buffers, their heights are set to 0. • Otherwise, buffers of height l are collapsed, generating a buffer of height l + 1. By tuning b and k, MLR98 can narrow the approximation error within ǫ. Figure 2 demonstrates the trigger strategy when b = 3. The height of the strategy tree is logarithmic, thus the space complexity is O( 1 ǫ log 2 (ǫN )). The limitation of MRL98 is that it needs to know the length of the data stream at first. However, in more cases, the length is uncertain and may even grow arbitrarily large. In such case, Greenwald et al. [34] proposes celebrated GK01 algorithm, distinguished by its innovative data structure, Summaries, or S for short. The basic idea is that when N increases, the set of ǫ-approximate answers for querying φ-quantile expands as well, so correctness can be retained even if removing some elements. S is a collection of tuples in the form of where v i is the element with the property v i ≤ v i+1 , g i and ∆ i are 2 integers satisfying following conditions: r(v i ), whose bounds are guaranteed by (1), is the ground truth of v i 's rank. Obviously, there are at most g i + ∆ i − 1 elements between v i−1 and v i . (2) makes sure that the range is within ⌊2ǫN ⌋ − 1. Therefore, for any φ ∈ (0, 1), there always exists a tuple and thus v i is a ǫ-approximation φ-quantile. To find the approximate quantile, we can find the least i satisfying: and return v i−1 as the answer. S also supports several operations: (2), t i is removable only when it satisfies • COMPRESS It can be found that DELETE is adding g of the deleting tuple to that of its predecessor. So we can delete multiple successive tuples, t i+1 , t i+2 , ..., t i+k at the same time by updating g i as g i = g i + g i+1 + g i+2 + · · · + g i+k and removing them. GK01 proposed a complicated COMPRESS strategy to reduce the size of S as small as possible: it executes when t i , t i+1 , ..., t i+k are removable on the arrival of every 1 2ǫ elements. It proves that the maximum size of S is 11 2ǫ log(2ǫN ). So, its space complexity is O( 1 ǫ log(ǫN )). Additionally, the summary structure is also applicable in a wide range such as Kolmogorov-Smirnov statistical tests [52] and balanced parallel computations [53].
GK01 is for computing quantiles over all arrived data. Sometimes quantiles of the most recent N elements in a stream are required. Lin et al. [35] expanded GK01 and proposed two algorithms for such case: SW model and n-of-N model. SW model is for answering quantiles over the most recent N elements instantaneously, where N is predefined. The model puts the most recent N elements into several buckets in their arriving order. Rather than original elements, each buckets stores a Summary [34] covering ǫN 2 successive elements. The buckets have 3 states as illustrated in Figure 3: • A bucket is active when its coverage is less than ǫN 2 . At this time, it maintains a ǫ 4 -approximate S computed by GK01. • A bucket is compressed when its coverage reaches ǫN 2 . The ǫ 4 -approximate S would be compressed to ǫ 2approximate S by an algorithm COMPRESS. • A bucket is expired if it is the oldest bucket when the coverage of all buckets exceeds N . Once expired, the bucket is removed from the bucket list. By maintaining and merging buckets, SW model answers quantile queries of elements covered by all unexpired buckets. However, when an old bucket is just expired and the active bucket has not been full, the coverage of all unexpired buckets N ′ is less than N . The difference between N and N ′ is at most ⌊ ǫN 2 ⌋ − 1. An algorithm LIFT was proposed to resolve . The distinction of n-of-N model is that it answers quantile queries instantaneously over the most recent n elements where n is any integer not larger than a predefined N . It takes advantage of the EH-partition technique [54] as shown in Figure 4. The technique classifies buckets, marking buckets at level i as i−bucket whose S covers all elements arriving since the bucket's timestamp. There are at most ⌈ 1 λ ⌉ + 1 buckets at each level, where λ ∈ (0, 1). When a new element arrives, the model creates a 1−bucket and sets its timestamp to the current timestamp. When the number of i − buckets reaches ⌈ 1 λ ⌉ + 2, the two oldest buckets at level i are merged into a 2i − bucket carrying the oldest timestamp iteratively until buckets at all levels are less than ⌈ 1 λ ⌉ + 2. Because each bucket covers elements arriving since its timestamp, merging two buckets is equal to removing the later one. Lin et al. also proved that for any bucket b, its coverage N b satisfies In order to guarantee that quantiles are ǫ-approximate, λ is set to ǫ ǫ+2 . Each bucket preserves a ǫ 2 -approximate S. Quantiles are queried as follows: 1) Scan the bucket list until finding the first bucket b making N b ≤ n. 2) Use LIFT to convert S b to a ǫ-approximate S covering n elements. 3) Search S to find the quantile answer. According to (5) ). Arasu et al. [36] generalized SW model and came up with fixed-and variable-size sliding window approximate quantile algorithms. A window is fixed-size when insertion and deletion of elements must appear in pairs after initialization. The fixed-size sliding window model defines blocks and levels as Figure 5. Each level preserves a partitioning of the data stream into non-overlapping blocks of equal size. Blocks and levels are both numbered sequentially. Assuming the window size is N , the block b in level l contains a Summary • A block is active if all elements covered by it belong to the current window. • A bucket is expired if it covers at least one element which is older than the other N elements. • A bucket is under construction if some of its elements belong to the current window while others are yet to arrive. The highest level with active blocks or blocks under construction L is log 2 ( 4 ǫ ). Blocks in levels above are marked expired.

N1+N2+···+Ns
. Thus, the fixed-size sliding window model can computes ǫ-approximate quantiles over the last N elements using O( 1 ǫ log 1 ǫ log N ) space. Contrary to a fixed-size window, a variable-size window bears no limitation on insertion and deletion, so the window size keeps changing. Arasu et al. used V (n, ǫ) to denote a epsilon-approximate Summary covering n elements, where n is the current size of the window. When a new element arrive, it becomes V (n + 1, ǫ), and when the oldest element leaves, it gets V (n − 1, ǫ). Besides, F n (N, ǫ) is defined as a restriction of F (N, ǫ) to the last n elements as Figure 6. F n (N, ǫ) is the same as F (N, ǫ) except that only blocks whose elements all belong to the most n recent elements, instead of N elements, are assumed active. V (n, ǫ) can be constructed by a set of F n (N, ǫ) in the form of is maintained under various operations as follows: 2 ) and insert the new element into it, thus getting 2 ). • QUERY In order to query ǫ-approximate quantiles over the most n ′ ≤ n recent elements, find the integer l satisfying 2 l−1 < n ≤ 2 l . Then query F n (2 l , ǫ 2 ) and return the answer.
The space complexity of the variable-size sliding window For uncertain data, one may still expect to obtain some consistent answers for queries [55]. Liang et al. [37] extended the problem and resolved approximate quantile queries over uncertain data streams. More specifically, it focuses on maintaining quantile Summaries [34] over data streams whose elements are drawn from individually domain space, represented by continuous or discrete pdf. An uncertain data stream S u is a sequence of elements as {e u 1 , e u 2 , ...}, each of which is drawn from a domain D i with pdf f i : D i → (0, 1] such that p∈Di f i (p) = 1. So, S u is a concise representation of exponential or infinite number of possible worlds W.
. Therefore, the definition of approximate quantiles is generalized as the element W is the rank of p in W and p min is the element minimizing W ∈W P r(W )q(r, r p W ). Liang et al. proposed 2 error metric functions as q(r, r p W ), the squared error function q(r, r p W ) = (r − r p W ) 2 and the related error function q(r, r p W ) = r − r p W . Following GK01, an online algorithm, namely UN-GK, is introduced. UN-GK adjusts tuples in Summaries as v i = p i ∈ D j of e u j ∈ S u , g i = P C min (p i ) − P C min (p i−1 ), and ∆ i = P C max (p i ) − P C min (p i ), where P C min (p) and P C max (p) is the lower bound and the upper bound of probabilistic cardinality [56] of elements in S u no larger than p. And (2) is modified as g i + ∆ i ≤ 2ǫP C(S u ), where P C(S u ) is the probabilistic cardinality of all elements in S u as P C(S u ) = |S u |. In this way, a ǫ-approximate quantile can be queried anytime over an uncertain data stream, and its space complexity is O( 1 ǫ log(ǫP C(S u ))).

B. Distributed Model
In distributed models such as sensor networks, communication between nodes consumes much energy and cuts down the battery life of power-constrained devices. In addition, data transmission takes most of the running time of algorithms. Therefore, the priority of approximate quantile algorithms over distributed models is to reduce the communication cost by decreasing the size of transmitted data.
In 2004, Shrivastava et al. [20] designed an approximate quantile algorithm on distributed sensor networks with fixeduniverse data, named as q-digest. Fixed-universe data refers to elements from a definite collection. Q-digest uses a unique binary tree structure to compress and store elements so that the storage space is cut down. The size of the fixed-universe collection is denoted as σ and the compress coefficient is denoted as k. Figure 7 is the structure of a q-digest tree, which exists in each network node. Each node of the tree covers elements ranging from v.min to v.max, recording the sum of their frequencies as count(v). Each leaf represents one element in the universe, in other words, v.min = v.max. Take node d as an example, its coverage is [7,8] and there are 2 elements in this range. Besides, each node v of the tree must satisfy: where v p is v's parent node and v s is its sibling node. (6) guarantees the upper bound of v' coverage to narrow approximation error while (7) demonstrates when to merge two small nodes so that the storage cost can be reduced. Q-digest proposed an algorithm COMPRESS to merge nodes of a tree: 1) Scan nodes from bottom to top, from left to right, and sum up count(v) and count(v s ) when (7) is violated. 2) Store the sum in v p .
3) Remove the count in v and v s . In order to reduce data communication, we number nodes in the tree from top to bottom, from left to right and only transmits nonempty nodes in the form of (id(v), count(v)), in which way every transmission contains only O(log σ + log N ) data. For example, node c in Figure 7 is transformed to (6, 2).
When a network node receives q-digest trees from other nodes, it merges them with its own tree by adding up count of nodes representing the same elements and applying COMPRESS. After merging all q-digest trees in the network, we can query quantiles by traversing the ultimate tree in postorder, summing up count of passed nodes until the sum exceeds ⌊φN ⌋. The queried quantile is v.max of the current node. The total communication cost of q-digest is O( 1 ǫ |v| log σ), where |v| denotes the number of network nodes.
In the same year, Greenwald et al. [38] proposed an algorithm on distributed models, referred to as GK04. Unlike q-digest, GK04 does not require that all data be fixed-universe. It maintains a collection of ǫ 2 -approximate Summaries [34], denoted as S v , in each node v in the sensor network. More specifically, In the network, data are transmitted in the form of S v . When data arrives at a node, it combines its own S v with the coming S v by iteratively merging S v of the same class from bottom to top. Once the iteration ends, a pruning algorithm is applied to S i v to reduce the number of its tuples to log nv ǫ +1 at most. The pruning would bring up the approximation error of S i v from ǫ ′ to ǫ ′ + ǫ 2 log nv at most, so a ǫ-approximate S is computed after merging S v from all nodes. GK04 only transmits O( 1 ǫ log 2 N ) data, where N is the number of all data in the network. Moreover, if the height of the network topology is far smaller than N , Greenwald et al. improved GK04 to reduce the total communication cost to O( |v| ǫ log N log( h ǫ )). Its basic idea is to apply a new operation, REDUCE, which makes sure that all nodes transmits less than O(log h ǫ ) Summaries, after merging and pruning.
Q-digest and GK04 are both offline algorithms that compute quantiles over stationary data in distributed models. Besides, there are also algorithms focusing on distributed models whose nodes receiving continuous data streams. In 2005, Cormode et al. [39] proposed a quantile algorithm for the scenario that multiple mutually isolated sensors are connected with one coordinator, which traces updating quantiles in real time. Its goal is to ensure ǫ-approximate quantiles at the coordinator while minimizing communication cost between nodes and the coordinator. In general, each remote node maintains a local approximate Summary [34] and informs the coordinator after certain number of updates. In the coordinator, the approximation error ǫ is divided into 2 parts as ǫ = α + β: • α is the approximation error of local Summaries sent to the coordinator. • β is the upper bound on the deviation of local Summaries since the last communication.
Intuitively, larger β allows for large deviation, thus less communication between nodes and the coordinator. But because ǫ is fixed, α is smaller, increasing the size of Summaries sent to the coordinator each time. In other words, α and β are on the trade-off relationship. To resolve the trade-off, Cormode et al. introduces a prediction model in each remote node that captures the anticipated behavior of its local data stream. With the model, the coordinator is able to predict the current state of a local data stream while computing the global Summary and the remote node can check for the deviation between its Summary and the coordinator's prediction. The algorithm proposes 3 concise prediction models: • Zero-Information Model assumes that there is no local update at any remote node since the last communication. • Synchronous-Updates Model assumes that at each time step, each local node receives one update to its distribution. • Update-Rates Model assumes that updates are observed at each local node at a uniform rate with a notion of global time. Using prediction models to reduce communication times, the total communication cost of this algorithm is O( |v| ǫ 2 log N ). In 2013, Yi et al. [40] proposed an algorithm in the same scenario and optimized the cost to O( |v| ǫ log N ). The algorithm divides the whole tracking period into O(log N ) rounds. A new round begins whenever N doubles. It first resolves the median-tracking problem, which can be easily generalized to the quantile-tracking problem. Assume that M is the cardinality of data, fixed at the beginning of a round, and m is the tracking median at the coordinator. The coordinator maintains 3 data structures. The first one is a dynamic set of disjoint intervals, each of which contains between ǫM III. RANDOMIZED ALGORITHMS Generally, randomized approximate quantile algorithms are combined with random sampling. They first sample a part of data and compute overall approximate quantiles with this portion. With less data taken into computation, the computation cost is cut down. In fact, many deterministic quantile algorithms propose randomized versions in this way [36], [40], [41].

A. Streaming Model
Back to 1971, Vapnik et al. [57] proposed a randomized quantile algorithm with space complexity of O( 1 ǫ 2 log 1 ǫ ). This benchmark were raised to O( 1 ǫ log 2 1 ǫ ) by Manku et al. [41] in 1999, referred to as MRL99. MRL99 does not require prior knowledge of the data size N and occupies less space in experiments when φ is an extreme value. The algorithm are improved based on MRL98 [33] so they have the identical frame with only minor differences in the operation NEW: 1) For each buffer, randomly select an element among r consecutive elements. 2) Repeats this operation k times to get k initial elements. 3) Set the buffer's weight to r.
Notice that MRL99 equals with MRL98 if r = 1. Because of the collapsing strategy as Figure 2, the larger weight a buffer has, the greater chance there its data must be retained while collapsing. If r is kept consistent, the probability of newly arrived data being selected will go down continuously. So r should keep changed dynamically, meaning the sampling is nonuniform. MRL99 initially sets r to 2 and traces a parameter h, representing the maximum height of all buffers. When a buffer's height reaches h + i for the first time, where i ≥ 0, r is doubled. By tuning h, b and k, MRL99 manages to compute approximate quantiles with a probability of at least 1 − δ.
Recalling two sliding window models in Section II-A, Arasu et al. proposed the fact that the quantile of a random sample of size O( 1 ǫ 2 log δ −1 ) is an ǫ-approximate quantile of N elements with the probability at least 1 − δ. To sample elements of specific size, a fast alternative is to randomly select one out of 2 k successive elements, where k = ⌊log 2 N/( 1 ǫ 2 log δ −1 )⌋. In this way, k grows logarithmically along with the data stream as required, so the approximation error and the randomization degree are guaranteed.
Another algorithm was proposed by Agarwal et al. [42], which is based on Summaries [34]. Its basic idea is to sample tuples from multiple Summaries and merge them to compute approximate quantiles with low space complexity. There are two situations while merging: same-weight merges and uneven-weight merges. Same-weight merges are for merging two Ss covering the same number of elements. It contains following steps: 1) Combine the two Ss in a sorted way. 2) Label the tuples in order and classify them by label parity. 3) Equiprobably select one class of tuples as the merged result S merged .
Assuming each S covers k elements, if we have k = O( 1 ǫ log( 1 ǫδ )), the algorithm will answer quantile queries with a probability of at least 1 − δ. Uneven-weight merges are for merging two Ss of different sizes, which can be reduced to same-weight merges by a so-called logarithmic technique [38]. The space complexity of the algorithm is O( 1 ǫ log 1.5 1 ǫ ).
However, Agarwal et al. just proposed and analyzed the algorithm in theory without implementation. Afterwards, Felber et al. [43] came up with a randomized algorithm whose space complexity is O( 1 ǫ log 1 ǫ ) but also did not realize it. Besides, this algorithm is not actually useful but only suitable for theoretical study because its hidden coefficient of O is too large.
In 2016, Karnin et al. [44] achieved the space complexity of O( 1 ǫ log log 1 ǫδ ). Its idea is based on MRL99 with some improvements. The first improvement is using increasing compactor capacities as the height gets larger (a compactor operates a COLLAPSE process in MRL99). The second comes from special handling of the top log log 1 δ compactors. And the last is replacing those top compactors with Summaries [34].

B. Distributed Model
As for distributed models, Huang et al. [45] proposed a randomized quantile algorithm which brings down total communication cost from O(|v| log 2 N ǫ ) in GK04 [38] to O( 1 ǫ |v|h), where h denotes the height of the network topology. It contains two version: the flat model and the tree model. For the flat model, all other nodes are assumed to be directly connected to the root node. The algorithm is designed as following steps: 1) Sample elements in node v with a probability of p and compute their ranks in v, denoted as r(a, v) where a is a sampled element. 2) Transmit sampled elements, as well as ones from its child nodes, to its parent node v p . 3) Find predecessors of a, denoted as pred(a, v s ), in v's sibling nodes v s . 4) Estimate the rank of a in v s , denoted asr(a, v s ), according to r(pred(a, v s ), v s ). 5) Compute the approximate rank of a in v p as r(a, v p ) = r(a, v s ) + r(a, v). If elements are not uniformly distributed in the network and some nodes contain the majority, the total communication cost will increase dramatically as the effect of load imbalance. The algorithm resolves the problem by tuning p based on data amount in each node: • If the amount is greater than N/ |v|, p is set to 1/(ǫN v ). • Otherwise, p is set to Θ( |v|/(ǫN )). The inconsistent probability of sampling makes sure that O( 1 ǫ ) elements are sampled at most in each node no matter how many elements there exist at first. After transmitting all sampled elements to the root node, the element a whose rank r(a, v root ) is closest to ⌊φN ⌋ is returned as the queried quantile. For the tree model, things become more complicated for two reasons. First, an intermediate node may suffer from heavy traffic going through if it has too many descendants without any data reduction. Second, each message needs O(h) hops to reach the root node, leading to the total communication of O(h |v|/ǫ). To resolve the first problem, Huang et al. proposed a algorithm MERGE in a systematic way to reduce data size. As for the second problem, the basic idea is to partition the routing tree into t connected components, each of which has O(|v|/t) nodes. Then each component is shrunk into a "super node". Now the height of the tree reduces to t. By setting t = |v|/h, the desired space bound becomes O( 1 ǫ |v|h). In 2018, Haeupler et al. [46], gave a drastically faster gossip algorithm, referred as Haeupler18, to compute approximate quantiles. Gossip algorithms [58] are algorithms that allow nodes in a distributed network to contact with each other randomly in each round and gradually converge to get final results. The algorithm contains two phases. In the first phase, each node adjusts its value so that the quantiles around φquantile become the median quantiles approximately. And in the second phase, nodes compute their approximate median quantiles to get the global result. Haeupler et al. proved that the algorithm requires O(log log N + log 1 ǫ ) rounds to solve the ǫ-approximate φ-quantile problem with high probability.
IV. IMPROVEMENT So far, in the discussion about approximate quantile algorithms, they are generally used with constant approximation error and indiscriminate performance on data regardless of data distribution. However, in some cases, we may have known that the data is skewed [59]- [63]. In other cases, quantile queries over high-speed data streams need to be updated and answered highly efficiently [64], [65]. In addition, there are also techniques for optimizing quantile computation with the help of GPUs [66], [67]. This section presents several techniques for improving the performance and efficiency of approximate quantile algorithms in various scenarios.

A. Skewness
The first algorithm is known as t-digest, proposed by Dunning et al. [63]. T-digest is for computing extreme quantiles such as the 99th, 99.9th and 99.99th percentiles. Totally different from q-digest, its basic idea is to cluster real-valued samples like histograms. But they differ in three aspects. First, the range covered by clusters may overlap. Second, instead of lower and upper bounds, a cluster is represented by a centroid value and an accumulated size on behalf of the number of elements. Third, clusters whose range is close to extreme values contain only a few elements so that the approximation error is not absolutely bounded, but relatively bounded, which is φ (1 − φ). T-digest can be applied to both streaming models and distributed models because the proposed cluster is a mergeable structure. The merge is restricted by the size bound of clusters. Dunning et al. proposed several scale functions to define the bound. The standard is that the size of each cluster should be small enough to get accurate quantiles, but large enough to avoid winding up too many clusters. A scale function is where δ is the compression parameters and the size bound is defined as where W lef t and W are respectively the weight of clusters whose centroid values are smaller than that of the current cluster and of current cluster. As Figure 8 shows, (8) is nondecreasing and is steeper when φ is closer to 0 or 1, which means clusters covering extreme values have smaller size, making the algorithm more accurate for computing extreme quantiles and more robust for skewed data. The second algorithm, proposed by Lin et al. [61] and elaborated by Liu et al. [62], is aimed at streaming models, using nonlinear interpolation. The algorithm maintains two buffers, the quantile buffer Q = {q 1 , q 2 , ..., q m }, where q i is the approximate φ i -quantile, and the data buffer B of size n, holding the most recent n elements. Q is estimated from observed data and incrementally updated when B is fullfilled. In order to estimate the extreme quantiles accurately, the nonlinear interpolation F (x), which is an approximate distribution function estimated from a training set stream, is leveraged together with B and Q to update Q.
The third algorithm was proposed by Cormode et al. [59] for skewed data. Again, the algorithm is an improved version of GK01 for two problems. The first problem is the biased quantiles problem. Low-biased quantiles are the set of elements whose rank ⌊φ i N ⌋ for i = 1, 2, ..., log 1/φ N , and high-biased quantiles are symmetry by reversing the ordering relation. The definition is easy to be generalized to approximate quantiles. The problem is computing the first k elements in high-biased approximate quantiles. Similar with GK01 that g i and ∆ i are restricted by g i + ∆ i ≤ ⌊2ǫN ⌋, the algorithm generalizes the restriction as For the biased quantiles problem, f (r i , N ) is set to 2ǫr i . The restriction is tighter than that in GK01 so the correctness is guaranteed. Cormode et al. proved that the space lower bound is Ω( 1 ǫ min(k log 1 φ , log(ǫN ))). The other problem is targeted quantiles problem that quantiles meeting a set of pairs T = {(φ i , ǫ i )} are required to be maintained. In such case, Also, the state-of-the-art space upper bound for biased quantile computation is O( 1 ǫ log 3 ǫN ) with a deterministic comparison-based merging-and-pruning strategy [68]. As for a randomized version, Zhang et al. [69] achieved where α > 0 and worst case of O( 1 ǫ 2 log 1 ǫ log 2 ǫN ). For problems resolved by q-digest [20] that all elements are selected from a fixed-universe collection, Cormode et al. [60] combined the binary tree structure in q-digest and standard dictionary data structures [70], proposing a new deterministic algorithm to compute biased quantiles with space complexity of O( 1 ǫ log σ log(ǫN )), where σ denotes the size of the fixeduniverse collection. And a simpler sampling-based approach by Gupta et al. [71] uses space of O(ǫ −3 log 2 N log σ).

B. High-Speed Data Streams
In order to compute quantiles over high-speed data streams, both computational cost and per-element update cost need to be low. Zhang et al. [65] proposed an algorithm for both fixed-and arbitrary-size high-speed data streams. Let N and n denote the number of elements in the entire data stream and elements seen so far. For the fixed-size data streams, where N is given, a multi-level summary structure S = {s 0 , s 1 , ..., s L } is maintained, where s i is the summary at level i, as shown in Figure 9. Each element in s is stored with its upper and lower bound of rank, r min (e) and r max (e). The data stream is divided into blocks of size b = ⌊ 1 ǫ log ǫN ⌋ and s i covers a disjoint bag B i . Among bags, B 0 contains the most recent blocks even though it may be incomplete. The structure is maintained as follows: 1) Insert the new element to s 0 .
2) If |s 0 | < b, the procedure is done. Otherwise, compress s 0 to generate a sketch s c of size ⌊ b 2 ⌋ and send it to level 1. COMPRESS would raise the approximation error from ǫ 0 to ǫ 0 + 1 b . 3) If s 1 is empty, set s 1 to s c and the procedure is done.
Otherwise, merge s 1 with s c and empty s 0 . Finally, compress the merged s and send it to level 2. 4) Repeat the steps above until an empty s i is found. To answer quantiles, the algorithm first sorts s 0 and merges s at all levels. Then it searches the merged s and return the element satisfying that r min (e) ≥ r − ⌊ǫN ⌋ and r max (e) ≤ r + ⌊ǫN ⌋. As for the arbitrary-size data streams, the basic idea is to partition the data stream into disjoint sub-streams d i with the size 2 i ǫ in their arrival order, and use the algorithm for the fixed-size data streams on each sub-stream because their length is known now. The computational cost and the per-element update cost of both algorithms are O(N log( 1 ǫ log ǫN )) and O(log log N ). Compared to GK01, the experimental results over high-speed data streams are reported to achieve about 200 ∼ 300x speedup.
Besides, in order to lighten the burden of massive continuous quantile queries with different φ and ǫ, Lin et al. [64] proposed 2 techniques for processing queries. The first technique is to cluster multiple queries as a single query virtually while guaranteeing accuracy. Its basic idea is to cluster the queries that share some common results. The second technique is to minimize both the total number of times for reprocessing and the number of clusters. It adopts a triggerbased lazy update paradigm.

C. GPU
Govindaraju et al. [66] studied optimizing quantile computation using graphics processors, or GPU for short. GPUs are well designed for rendering and allow many rendering applications to raise memory performance [72]. In order to utilize the high computational power of GPUs, Govindaraju et al. proposed an algorithm based on sorting networks. Sorting networks are a set of sorting algorithms mapped well to meshbased architectures [73]. Operations in the algorithm, including comparisons and comparator mapping, are realized by color blending and texture mapping in GPUs. The theoretical algorithm they used is GK04 [38] and by taking advantage of high computational power and memory bandwidth of GPUs, the algorithm offers great performance for quantile computation over streaming models.

V. APPROXIMATE QUANTILE COMPUTATION TOOLS
Over the past decades, many off-the-shelf, open-source packages or tools for quantile computation have been developed and available to users. Some of them are based on exact quantile computation while others implement approximate quantile computation. In this section, we review such tools, focusing on their algorithmic theories and application scenarios.
Most basically, SQL provides a window function ntile 1 which helps with quantile computation. It receives a parameter b and partitions a set of data into b buckets equally. Therefore, with sortBy and ntile, we can partition the data in order and get the leading element of each partition as the quantile. Also, these quantiles are exact.
MATLAB is a famous numerical computing environment and programming language. It provides comprehensive packages for computing various statistics or functions. Quantiles are included as its in-tool function 2 . The function quantile receives a few parameters, including a numerical array, the cumulative probability and the computation kind. It computes quantiles in either exact or approximate way. It computes exact quantiles by the classic algorithm that uses sort. And it implements t-digest [63] for approximate quantiles. So, this function is suitable to compute quantiles over distributed models, which benefit from parallel computation.
The distributed cluster-computing framework, Spark, provides approximate quantile computation since version 2.0 3 . It implements GK01 and can be called in many programming languages such as Python, Scala, R and Java. Computing on a Spark dataframe, the function contains 3 parameters as a dataframe, the name of a numerical column, a list of quantile probabilities and the approximation error. Since version 2.2, the function has been upgraded to support computation over multiple dataframe columns.
In Java, Google publishes an open-source set of core libraries, named as Guava 4 . It includes new collection types, graph libraries, support for concurrency, etc. Also, quantile computation is included in its functions as median() and percentiles(). The quantiles are exact results so the average time complexity of its implementation is O(n) while the worst case time complexity is O(N 2 ). It optimizes multiple quantile computation on the same dataset with indexes, improving the performance in some degree. Another package that supports quantile computation is Apache Beam 5 . It is a unified programming model for batch and streaming data processing on execution engines. Unlike Guava, it implements approximate quantile computation.
The programming language, Rust 6 , implements approximate quantile computation over data streams with a moderate amount of memory. It implements the algorithms, GK01 [34] and CKMS [59], so that no boatload of information is stored in memory. Besides, it implements a variant of quantile computation, an ǫ-approximate frequency count for a stream, outputting the k most frequent elements, with the algorithm Misra-Gries [74], which helps with an estimation of quantiles as well. As for C++, a peer-reviewed set of core libraries, Boost, provides exact quantile computation as its in-tool function 7 .

VI. FUTURE DIRECTIONS AND CONCLUSIONS
In this survey, we have presented a comprehensive survey of approximate quantile computation. Readers who want to know more about the performance of basic quantile algorithms can read a paper by Luo et al. [29], which implements a part of the quantile algorithms above and compares them by experiments. Besides, Cormode et al. [75] summarized lower bounds of the researches about comparison-based quantile computation so far and proved that the space complexity of GK01 is optimal by showing a matching lower bound.
Even though approximate quantiles have been studied for more than three decades, there is still plenty of room for improvement. The explosion of data also brings new challenges to this field. There are many studies to resolve quantile computation in large-scale data, but not enough. Besides, with the development of industries, new scenarios keep appearing so that quantile computation should be optimized specifically as well. And the evolvement of techniques brings new direction and new potential to quantile computation. Here we present a few future directions for quantile computation.
First, almost all quantile algorithms require at least onepass scan over the entire dataset, no matter it is applied on streaming models or distributed models. However, at some time, the cost of scanning the entire data is intolerable if quantile queries are demanded to be answered in time. In such case, only a portion of the entire dataset is permitted into the process of the whole algorithm. This condition is more restricted than that in existing randomized quantile algorithms. Even if randomized algorithms sample a part of data to save the space of computation, the process of sampling requires the participation of the entire dataset, which means that they need to scan all data once. To resolve the problem, we need to determine the sampling methods first. Unlike fine-grained sampling in existing randomized algorithms, we may use coarse-grained sampling, such as sampling in the unit of block. The method should also take the way of storage into account. For example, if the dataset is stored in distributed HDFS [76], we may sample part of blocks, avoiding scanning the entire data. The sampling methods may be exclusive, varying based on applications. After determining sampling methods, we need to consider how much data is sampled to balance the accuracy and the occupation of time and space. One direction is to simulate the idea in machine learning [77] that data are trained (or sampled in our situation) continuously until the quantile result converges.
Second, many new computation engines are being developed. They can be classified into two categories in general. One is streaming computation engines and the other is distributed computation engines. Streaming computation engines, represented by Spark Streaming, Storm and Flink [78], are aimed at real-time streaming needs with minor difference in some ways. For example, Storm and Flink behave like true streaming processing systems with lower latencies while Spark Streaming can handle higher throughput at the cost of higher latencies. Distributed computation engines, represented by Spark and GraphLab [79], are designed to analyze distributed data such as datasets with graph properties.They have their pros and cons in heterogeneous scenarios as well. The characteristics of these engines may be utilized in the implementation of algorithms. As reviewed in Section V, only a small fraction of approximate quantile algorithms is implemented in them such as GK01 in Spark. But many other algorithms are still needed to be implemented and optimized purposefully. For example, Spark [80] is a framework for computation in distributed clusters, supporting parallel computation naturally and having potential of benefitting many quantile algorithms such as MRL98. Its streaming version, Spark Structured Streaming [81], supports stream processing, as well as window operations. It may help quantile algorithms over streaming models, such as Lin SW and Lin n-of-N [35], to be implemented in a more efficient way, even though the theoretical complexity remains the same. Many quantile algorithms are implemented with basic languages while others are even without implementation. Transplanting them to new computation engines is not an easy work and there is great room for optimization.
Third, with the appearance of new specific application scenarios, the requirements of approximate quantile algorithms evolve as well. For example, nowadays, more and more attention is being paid to the correlation of data, and data graph is one way to present the correlation. In a data graph, each edge is usually associated with a weight, representing frequencies of the appearance of the correlation. One may need to determine a threshold for pruning edges by their weights for better performance in analysis [82], [83]. And we can use quantiles to determine the threshold. However, unlike random sampling in a dataset, the edges in a graph are correlated (the frequencies of two edges connecting to the same vertex are correlated). Maybe it is a challenge, as well as an opportunity, to efficiently compute approximate quantiles in a correlated data graph. Furthermore, if the graph is constrained by some patterns [84], [85], the algorithms may be improved and optimized correspondingly. Other cases include computing quantiles in the Blockchain network [86]. Unlike traditional distributed models, which have a master node and multiple slave nodes, the Blockchain network is decentralized, making merging quantile algorithms infeasible. Further research is needed for computing approximate quantiles in such network. Haeupler et al. [46], which uses gossip algorithms, provides a good direction, but there is much more to be done.