No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable Guarantees

Automating physical database design has remained a long-term interest in database research due to substantial performance gains afforded by optimised structures. Despite significant progress, a majority of today's commercial solutions are highly manual, requiring offline invocation by database administrators (DBAs). This status quo is untenable: identifying representative static workloads is no longer realistic; and physical design tools remain susceptible to the query optimiser's cost misestimates. Furthermore, modern application environments like hybrid transactional and analytical processing (HTAP) systems render analytical modelling next to impossible. We propose a self-driving approach to online index selection that does not depend on the DBA and query optimiser, and instead learns the benefits of viable structures through strategic exploration and direct performance observation. We view the problem as one of sequential decision making under uncertainty, specifically within the bandit learning setting. Multi-armed bandits balance exploration and exploitation to provably guarantee average performance that converges to policies that are optimal with perfect hindsight. Our comprehensive empirical evaluation against a state-of-the-art commercial tuning tool demonstrates up to 75% speed-up in analytical processing environments and 59% speed-up in HTAP environments. Lastly, our bandit framework outperforms a Monte Carlo tree search (MCTS)-based database optimiser, providing up to 24% speed-up.


Introduction
With the growing complexity and variability of database applications and their hosting platforms (e.g., multitenant cloud environments), automated physical design tuning, particularly automated index selection, has reemerged as a contemporary challenge for database management systems.Most database vendors offer automated tools for physical design tuning within their product suites [3,68,21].Such tools form an integral part of broader efforts toward fully automated database management systems which aim to: a) decrease database administration costs and thus total costs of ownership [67,56]; b) help non-experts use database systems; and c) facilitate hosting of databases on dynamic environments such as cloud-based services [20,52,22,24].Most physical design tools take an off-line approach, where the representative training workload is provided by the database administrator (DBA) [16].Where online solutions are provided [61,60,13,12,22,46], questions remain: How often should the tools be invoked?And importantly, is the quality of proposed designs in any way guaranteed?How can tools generalise beyond queries seen to dynamic ad-hoc workloads, where queries are unpredictable and non-stationary?
Modern analytics workloads are dynamic in nature with ad-hoc queries common [38], e.g., data exploration workloads adapt to past query responses [4].Such adhoc workloads hinder automated tuning since: a) inputting representative information to design tools is infeasible under time-evolving workloads; and b) reacting too quickly to changes may result in performance variability, where indices are continuously dropped and created.Any robust automated physical design solution must address such challenges [61].
The situation is further aggravated in HTAP environments, that consist of online transaction processing (OLTP) and online analytical processing (OLAP) workloads.While indices provide (primarily) positive benefits to OLAP queries, they hinder the OLTP query performance due to the additional index maintenance overhead.Furthermore, in dynamic settings, workload composition (i.e., analytical to transaction ratio) can vary over time, making it even more challenging to identify useful indices that boost overall workload performance.
To compare alternative physical design structures, automated design tools use a cost model employed by the query optimiser, typically exposed through a "whatif" interface [14], as the sole source of truth.However such cost models make inappropriate assumptions about data characteristics [17,44]: commercial DBMSs often assume attribute value independence and uniform data distributions when sufficient statistics are unavailable [2,44,11].As a result, estimated benefits of proposed designs may diverge significantly from actual workload performance [9,26,11,10,24,22].Even with more complex data distribution statistics such as single-and multicolumn histograms, the issue remains for complex workloads [11].Moreover, data additions and updates in HTAP environments continuously invalidate statistics, adding on to the optimiser misestimates.Keeping statistics up-to-date in such a setting requires extra effort.
In this paper, we demonstrate that even in ad-hoc environments where queries are unpredictable, there are opportunities for index optimisation.We argue that the problem of online index selection under ad-hoc, analytical and HTAP workloads can be efficiently formulated within the multi-armed bandit (MAB) learning setting-a tractable form of Markov decision process.MABs take arms or actions (selecting indices) to maximise cumulative rewards, trading off exploration of untried actions with exploitation of actions that maximise rewards observed so far (see Figure 1).MABs permit learning from observations of actual performance, and need not rely on potentially misspecified cost models.Unlike initial efforts with applying learning for physical design, e.g., more general forms of reinforcement learning [62], bandits offer regret bounds that guarantee the fitness of dynamically-proposed indices [58].
The key contributions of the paper are summarised next: -We model index tuning as a multi-armed bandit, proposing design choices that lead to a practical, competitive solution.-Our proposed design achieves a worst-case safety guarantee against any optimal fixed policy, as a consequence of a corrected regret analysis of the C 2 UCB bandit.
-We introduce a new bandit flavour that extends the existing contextual and combinatorial bandit where structured rewards are observed for each arm, providing additional feedback for the learned weight.This bandit variation enjoys a better regret bound compared to the C 2 UCB bandit.-Our comprehensive experiments demonstrate MAB's superiority over a state-of-the-art commercial physical design tool and deep reinforcement learning agent, with up to 75% speed-up in the former and 58% speed-up in the latter case, under dynamic, analytical workloads.-We showcase MAB's ability to perform in complex HTAP environments, which are notoriously challenging for index tuning, delivering 59% speed-up over the state-of-the-art commercial design tool.

Problem Formulation
The goal of the online database index selection problem is to choose a set of indices (referred to as a configuration) that minimises the total running time of a workload sequence within a given memory budget.Neither the workload sequence, nor system run times, are known in advance.
We adopt the problem definition of [13].Let the workload W = (w 1 , w 2 , . . ., w T ) be a sequence of miniworkloads (e.g., a sequence of single queries), I the set of secondary indices, C mem (s) represent the memory space required to materialise a configuration s ⊆ I, and S = {s ⊆ I | C mem (s) ≤ M } ⊆ 2 I be the class of index configurations feasible within our total memory allowance M .Our goal is to propose a configuration sequence S = (s 0 , s 1 , . . ., s T ), with s t ∈ S as the configuration in round t and s 0 = ∅ as the starting configuration, which minimises the total workload time C tot (W, S) defined as: Here C rec (t) refers to the recommendation time in round t (defined as running time of the recommendation tool) and C cre (s t−1 , s t ) refers to the incremental index creation time in transitioning from configuration s t−1 to s t .Finally, C exc (w t , s t ) denotes the execution time of mini-workload w t under the configuration s t , namely the sum of response times of individual queries.
At round t, the system: 1. Chooses a set of indices s t ∈ S in preparation for upcoming workload w t , without direct access to w t .s t only depends on observation of historical workloads (w 1 , . . ., w t−1 ), corresponding sets of chosen indices, and resulting performance; 2. Materialises the indices in s t which do not exist yet, that is, all indices in the set difference s t \s t−1 ; and 3. Receives workload w t , executes all the queries therein, and measures elapsed time of each individual query and each operator in the corresponding query plan.

Contextual Combinatorial Bandits
In this paper, we argue that online index selection can be successfully addressed using multi-armed bandits (MABs) from statistical machine learning, where different arms correspond to chosen indices.We first present necessary background on MABs, outlining the essential properties that we exploit in our work (i.e., bandit context and combinatorial arms) to converge to highly performant index configurations.We use the following notation.We denote non-scalar values with boldface: lowercase for (by default column) vectors and uppercase for matrices.We also write [k] = {1, 2, . . ., k} for k ∈ N, and denote the transpose of a matrix or a vector with a prime.
The contextual combinatorial bandit setting under semi-bandit feedback involves repeated selections from k possible actions, over rounds t = 1, 2, . .., in which the MAB: 1. Observes a context feature vector (possibly random or adversarially chosen) of each action or arm i ∈ [k], denoted as X t = {x t (i)} i∈[k] , for x t (i) ∈ R d , along with their costs, c i ; 2. Selects or pulls a set of arms (referred to as super arm) s t ∈ S t , where we restrict the class of possible super arms S t ⊆ S t = s ⊆ [k] i∈s c i ≤ M ⊆ 2 [k] ; and 3.For each i t ∈ s t , observes random scores r t (i t ) drawn from fixed but unknown arm distribution which depends solely on the arm i t and its context x t (i t ), whose true expected values are contained in the unknown variable r t = {E[r t (i)]} i∈ [k] .
Remark 1 The contextual combinatorial bandit setting is a special case of a Markov decision process, which is solved in general by reinforcement learning (RL).The key difference is that in bandits, state transition is not affected by MAB actions, only rewards are.States (observed via contexts) arrive arbitrarily.This simplicity admits theoretical guarantees for practical MAB learners, where state-of-the-art RL agents regularly have none.When playing in a bandit setting, in practice MAB learners may converge faster than their (typically over parametrised) RL cousins.
A MAB's goal is to maximise the cumulative expected reward t E[R t (s t )] = t g(s t , r t , X t ) for a known function g.This function g need not be a simple summation of all the scores, but is typically assumed to be monotonic and Lipschitz smooth in the arm scores.
The core challenge in this problem is that the expected scores for all arms i ∈ [k] are unknown.Refinement of a bandit learner's approximation for arm i is generally only possible by including arm i in the super arm, as the score for arm i is not observable when i is not played.This suggests solutions that balance exploration and exploitation.Even though at first glance it may seem that each arm needs to be explored at least once, placing practical limits on large numbers of arms, there is a remedy to this as will be discussed shortly.
The C 2 UCB algorithm.Used to solve the contextual combinatorial bandit problem, the C 2 UCB Algorithm [58] models the arms' scores as linearly dependent on their contexts: r t (i) = θ x t (i) + ε t (i) for unknown zero-mean (subgaussian) random variable ε t , unknown but fixed parameter θ ∈ R d , and known context x t (i).It is crucial to notice the implication that, all learned knowledge is contained in estimates of θ, which is shared between all arms, obviating the need to explore each arm.Estimation of θ can be achieved using ridge regression, with |s t | new data points {(x t (i), r t (i))} i∈st available at round t, further accelerating the convergence rate of the estimator θ, over observing only one example as might be naïvely assumed.
Point estimates on the expected scores can be made with rt (i) = θ t x t (i), where θt are trained coefficients  of a ridge regression on observed rewards against contexts.However, this quantity is oblivious to the variance in the score estimation.Intuitively, to balance out the exploration and exploitation, it is desirable to add an exploration boost to the arms whose score we are less sure of (i.e., greater estimate variance).This suggests that the upper confidence bound (UCB) should be used, in place of the expected value, and which can be calculated [45] as: where α t > 0 is the exploration boost factor, and V t−1 is the positive-definite d × d scatter matrix of contexts for the chosen arms up to and including round t − 1.
The first term of rt (i) corresponds to arm i's immediate reward, whereas its second term corresponds to its exploration boost, as its value is larger when the arm is sensitive to the context elements we are less confident of (i.e., the underexplored context dimension).Hence, by using rt (i) in place of rt (i), arms with contexts lying in the underexplored regions of context space are more likely to be chosen, as higher scores yield higher g, assuming that g is monotonic increasing in the arm rewards.
Ideally, the super arm s t ∈ S t is chosen such that g(s t , rt , X t ) is maximised.However, it is sometimes computationally expensive to find such super arms.In such cases, it is often good enough to obtain a solution via some approximation algorithm where g(r t , X t , s t ) is near maximum.With this criterion in mind, we now define an α-approximation oracle.
Definition 3 An α-approximation oracle is an algorithm A that outputs a super arm s = A(r, X) with guarantee g(s, r, X) ≥ α • max s g(s, r, X), for some α ∈ [0, 1] and given input r and X.
end for 10: The performance of a bandit algorithm is usually measured by its cumulative regret, defined as the total expected difference between the reward of the chosen super arm E[R t (s t )] and an optimal super arm max s∈St E[R t (s)] over T rounds.Such a metric is unfair to C 2 UCB since its performance depends on the oracle's performance.This suggests assessing C 2 UCB's performance with a metric using the oracle's performance guarantee as its measuring stick, as follows.Definition 4 Let s be a super arm returned by an αapproximation oracle as a part of the bandit algorithm, and r t be a vector containing each arms' true expected scores.Then cumulative α-regret is the sum of expected instantaneous regret, When g is assumed to be monotonic and Lipschitz continuous, [58] claimed that C 2 UCB enjoys Õ( √ T ) αregret.We have corrected an error in the original proof, as seen in Appendix A, confirming the Õ( √ T ) α-regret.This expression is sub-linear in T , implying that the per-round average cumulative regret approaches zero after sufficiently many rounds.Consequently, online index selection based on C 2 UCB comes endowed with a safety guarantee on worst-case performance: selections become at least as good as an α-optimal policy (with perfect access to true scores); and potentially much better than any fixed policy.

MAB for Online Index Selection
Performant bandit learning for online index tuning demands arms covering important actions and no more, rewards that are observable and for which regret bounds are meaningful, and contexts and oracle that are efficiently computable and predictive of rewards.Each workload query is monitored for characteristics such as running time, query predicates, payload, etc. (see Figure 1).These observations feed into generation of relevant arms and their contexts.The learner selects a desired configuration which is materialised.After query return, the system identifies benefits of the materialised indices, which are then shaped into the reward signal for learning.
Dynamic arms from workload predicates.Instead of enumerating all column combinations, relevant arms (indices) may be generated based on queries: combinations and permutations of query predicates (including join predicates), with and without inclusion of payload attributes from the selection clause.Such workload-based arm generation drastically reduces the action space, and exploits natural skewness of real-life workloads that focus on small subsets of attributes over full tables [4].Workload-based arm generation is only viable due to dynamic arm addition (reflecting a dynamic action space) and is allowed by the bandit setting: we may define the set of feasible arms for each round at its start.
Context engineering.Effective contexts are predictive of rewards, efficiently computable, and promote generalisation to previously unseen workloads and arms.We form our context in two parts (see Figure 1).
Context Part 1: Indexed column prefix.We encode one context component per column.However unlike a bag-of-words or one-hot representation appropriate for text, similarity of arms depends on having similar column prefixes; common index columns is insufficient.This reflects a novel bandit learning aspect of the problem.A context component has value 10 −j where j is the corresponding column's position in the index, provided that the column is included in the index and is a workload predicate column.The value is set to 0 otherwise, including if its presence only covers the payload.Unlike a simple one hot encoding, this context enables the bandit to differentiate between arms with the same set of columns but different ordering, and reward the columns differently based on the column's position in the index.
Example 1 Under the simplest workload (single query) in Figure 1, our system generates six arms: four using different combinations and permutations of the predicates, two including the payload (covering indices).Index IX5 includes column C1, but the context for C1 is valued as 0, as this column is considered only due to the query payload.
Context Part 2: Derived statistical information.We represent statistical and derived information about the arms and workload, details available during query execution, and sufficient statistics for unbiased estimates.This statistical information includes: a Boolean indicating a covering index, the estimated size of the index divided by the database size (if not materialised already, 0 otherwise), and the number of times the optimiser has picked this arm in recent rounds.This is shown in Figure 1 under D1, D2 and D3, respectively.
Reward shaping.As the goal of physical design tuning tools is to minimise end-to-end workload time, we incorporate index creation time and query execution time into the reward for a workload.We omit index recommendation time, as it is independent of arm selection.However, we measure and report recommendation time of the MAB algorithm in our experiments.Recall that MAB depends only on observed execution statistics from implemented configurations and generalisation of the learned knowledge to unseen arms thereafter.
The implementation of the reward for an arm includes the execution time as a gain G t (i, w t , s t ) for a workload w t by each arm i under configuration s t .Indices can impact the execution time in multiple ways.We split the execution time gain into three components: a) data scan gains (G ds t ), b) index maintenance gains (G im t , usually a negative value), and c) other areas of the query plan which can be difficult to attribute to a single index (unclaimed gains) (G un t ).
Data scan gains: By defining U(s, q) as the list of indices used by the query optimiser for data access in query q under a given configuration s, the data scan gain by index i for query q is defined as: C tab (τ (i), q, ∅) represents the full table scan time for table τ (i) and query q, where τ (i) is the table which i belongs to. 1ndex maintenance gains: Index maintenance operations can take different forms based on the number of rows updated.The optimiser typically opts for row-wise updates for a small number of rows and index-wise updates otherwise.In the second case, we can easily capture the maintenance gain of an index as each index is updated separately.This is however not straightforward in the case of row-wise updates, where all indices are bulk updated for each row.On these occasions, we compute the total maintenance gain (G im t (V(s, q), {q}, s t )) for all secondary indices that require maintenance due to a query q under a given configuration s and equally divide it among the updated indices.V(s, q) represents the set of indices updated under configuration s by the query q.G im t (V(s, q), {q}, where C im (q, ∅) and C im (q, s) represent the index maintenance time without secondary indices and index maintenance time under configuration s, respectively.
Unclaimed gains: The use of indices can impact the query plan in very subtle ways which cannot be easily attributed to a single index.For example, introducing a new index can trigger the optimiser to choose a different query plan.Even when the index use provides a faster data scan, new query execution can end up taking more time due to an inefficient nested loop join.Even though this issue arises from the optimiser, MAB needs to synchronise with the optimiser and possibly take corrective actions to trigger a different query plan that can result in better execution time.These gains can be computed by comparing the query running times before and after index creation.We compute the total query gain (G to t ) as, where C to (q, s t ) represents the total running time under configuration s t .Once the data scan gain and index maintenance gain is calculated, we calculate the total unclaimed gains for a query by subtracting the total data scan and index maintenance gains from the total query gain.Then we equally divide this cost among participating indices (U(s, q) ∪ V(s, q)).
The gain for a workload relates to the gain for individual query by: By this definition, gain G t (i, w t , s t ) will be 0 if i is not used by the optimiser in the current round t and can be negative if the index creation leads to a performance regression or if the index incurs a maintenance cost.Creation time of i is taken as a negative reward, only if i is materialised in round t, and is 0 otherwise: Minimising the end-to-end workload time, or rather, maximising the end-to-end workload time gained, is the goal of the bandit.As defined earlier, the total workload time C tot is the sum of execution, recommendation and creation times accumulated over rounds.As such, minimising each round's summand is an equivalent problem.Modifying the execution time to the time gain while ignoring the recommendation time yields per-round super arm reward of: Selection of the execution plan depends on the query optimiser, and as noted, the query optimiser may resolve to a sub-optimal query plan.As we show, the bandit is nonetheless resilient as it can quickly recover from any such performance regressions.Observed execution times encapsulate real-world effects, e.g., the interaction between queries, application properties, run-time parameters, etc.Since the end-to-end workload time includes the index creation and query execution times, Algorithm 2 MAB Simulation for Index Tuning A greedy oracle for super-arm selection.Recall that C 2 UCB leverages a near-optimal oracle to select a super arm, based on individual arm scores [58].As a sum of individual arm rewards, our super-arm reward has a (sub)modular objective function and (as easily proven) exhibits monotonicity and Lipschitz continuity.Approximate solutions to maximise submodular (diminishing returns) objective functions can be obtained with greedy oracles that are efficient and nearoptimal [54].Our implementation uses such an oracle combined with filtering to encourage diversity.
Initially, arms with negative scores are pruned.Then arm selection and filtering steps alternate, until the memory budget is reached.In the selection step, an arm is selected greedily based on individual scores.The filtering step filters out arms that are no longer viable under the remaining memory budget, or those that are already covered by the selected arms based on prefix matching.If a covering index is selected for a query, all other arms generated for that query will be filtered out.Note that filtering is a temporary process that only impacts the current round.
Bandit learning algorithm.Algorithm 2 shows the MAB algorithm, which wraps Algorithm 1 and handles the domain specific aspects of the implementation.We have divided Algorithm 1 into three main parts, initialisation (lines 1-2), arm recommendation (lines 4-10) and weight vector update (lines 12-13).These segments are utilised in Algorithm 2 as C 2 UCB function calls.After initialising the bandit, Algorithm 2 summarises workload information using templates; these track frequency, average selectivity, first seen and last seen times of the queries which help to generate the best set of arms per round (i.e., QoI).The context is updated after each round based on the workload and selected set of arms.The bandit then selects the round's set of arms, forming a configuration to be materialised within the database.The reward will then be calculated based on observed execution statistics on a new set of queries, and will be used to update the shared weight.To support shifting workloads, where users' interests change over time, the learner may forget learned knowledge depending on the workload shift intensity (i.e., the number of newly introduced query templates).
In our implementation, we perform bandit updates separately for creation time reward and execution time reward (line 13 of Algorithm 1).At the creation cost update, we temporary make all context features 0 except for the context feature that is responsible for the index size.This can be viewed as an innovation of independent interest where we decompose the reward into multiple components and want to direct each reward component feedback to a subset of the features.We term this a focused update.This idea invites a new flavour of bandits elaborated in the next section.

Contextual Combinatorial Bandit with Structured Rewards
When rewards can be decomposed into component rewards under two key conditions, we hypothesise that a focused updated can result in faster convergence: (i) when each reward component is directly related to a small subset of context features we create lower dimensional supervised learning problems; and (ii) when each reward component is directly observed we offer more opportunity for bandit feedback.Under focused updates we use each component of the reward to learn part of the weight vector (see Figure 2).Indeed for this structured setting we modify the proof the C 2 UCB to arrive at tighter regret bounds by a factor of 1/ √ n f , where n f is the number of reward components (i.e., observed number of examples).
Our approach to structured rewards is by a reduction to C 2 UCB.We modify the C 2 UCB's formulation as if two examples are observed for pulled arm i in round t with respective rewards rt,1 (i) and rt,2 (i).Throughout both (sub round) observations, the overall arm reward function is fixed as r t (i) = rt,1 (i) + rt,2 (i).This permits learning at a faster rate, while still coordinating an overall arm reward estimate.
Example 2 In the setting of this paper, we are motivated by the desire for context part 1 to be completely responsible for execution cost gains-we can learn the negative weight from index creation cost directly into 2 Regular contextual updates vs focused update.part 2's index size feature.This allows us to switch off the creation cost overhead for already created arms by simply setting the index size context feature to zero.This ability to use domain knowledge to target reward feedback to a subset of context features is a powerful benefit of structured rewards.
for two independent zero-mean (subgaussian) noise random variables ε t,1 and ε t,2 with equal variance 2 .To see the benefit of the focused update, we further assume that ε t,1 and ε t,2 are each R √ 2 -subgaussian, which makes ε t R-subgaussian. 3 That the overall context is the sum of the sub-contexts does not change the role of the overall context for its respective arm, ensuring that the UCB in Equation (1) remains the same.We further assume complementary sparse subcontexts: that overall context To maintain valid learning given two observations per round with equal variances, at the end of round t we generalise our matrix V t and b t updates to: Notice that this is different from C 2 UCB's definition of V t and b t , and hence a new regret analysis is warranted.
2 Equivariance is without loss of generality.Were the variances to differ, we obtain different expressions for V t and b t , weighing data with less variance more: The value of γ 2 stays the same, serving as the variance for the prior of θ.
3 For independent r.
We exploit the fact that Theorem 4.2 in [58] is true regardless of the super arm S t .Therefore, solely for the purpose of modifying the aforementioned theorem, we re-index the context and rewards such that xt,f (i) = xt (i + kf ) and rt,f (i) = rt (i + kf ), and we construct the effective super arm S t = {i : i = i + kf, i ∈ S t , f ∈ {0, 1}}.As such, our definition of V t and b t can now be rewritten as: which is syntactically the same as the definition given in [58].We need to be more careful in concluding the theorem, however, since it contains an intermediate step involving det(V t ) as defined in [1].Assuming that x t (i) ≤ 1, we bound det(V t ) as follows: where we have used AM-GM Inequality for the first inequality and the fact that the overall context is the first and second individual relevant contexts concatenated to arrive at the last equality.Finally, using the fact that with probability at least 1−δ, which is the same as that in [58], with the exception of the definition of V t .Conveniently, Lemma 4.1 from [58] is written in terms of V t−1 and α t , thus the proof follows exactly besides the choice of α t , rewritten below for convenience: holds simultaneously for all t ≥ 0 and i ∈ [k] with probability at least 1 − δ.
We have provided the correction of the proof of Lemma 4.2 from [58] in Appendix A. This proof can be used by changing the definition of the matrix X T into: and the rest of the proof follows very similarly, with slight difference in the dimension of X T , changing from |s T | into 2|s T |.Finally, the third last equality requires us to find the trace of the matrix of interest, which is tr( Therefore, we have our modification of Lemma 4.2 of [58] as follows: Lemma 2 Let V ∈ R d×d be a positive definite matrix, s t ⊆ {1, • • • , k} where |s t | ≤ for t = 1, 2, . . ., and Since there is no modification on the objective function, and since all the theorems and lemma required to arrive at the final regret bound are the same, the regret bound for the modified C 2 UCB stays the same, as stated in [58]: Notice that in cases where there are n f examples per arm in each round instead of only two, the regret will generalise into: which has a factor of 1 √ n f , n f ∈ N compared to the original C 2 UCB where n f = 1.

Experimental Methodology
We evaluate our MAB framework across a range of widely used analytical and HTAP industrial benchmarks, comparing it to a state-of-the-art physical design tool shipped with a commercial database product referred to as the Physical Design Tool (PDTool).This is a mature product, proven to outperform other physical design tools available on the market [9,40].We also test our framework against a highly tuned RL agent, which is a popular technique adopted in recent database tuning research [62,8,31,56].
Benchmarks.We use five publicly available analytical processing benchmarks: TPC-H (with uniform distribution) [64] and TPC-H Skew [50] with zipfian factor 4, allowing the reader to understand the impact of data skewness when all the other aspects are kept identical; TPC-DS [51], which probes solution fitness under a large number of candidate configurations; SSB [53] with easily achievable high index benefits; and finally, Join Order Benchmark (JOB) with IMDb dataset (a real-world dataset) [44] (henceforth referred to as IMDb) a challenging workload for index recommendations, with index overuse leading to performance regressions.
For HTAP performance testing we use CH-BenCHmark [19,23], TPC-H and TPC-H Skew benchmarks.CH-BenCHmark provides a complex mixed workload, combining TPC-C and TPC-H benchmarks.The CH-BenCHmark schema is composed of unmodified TPC-C schema and three tables (Supplier, Region, Nation) from TPC-H.Its workload is composed of TPC-C [63] transactional workload and modified 22 TPC-H [64] queries adapted to the CH-BenCHmark schema.
While CH-BenCHmark provides a uniform dataset, we are not aware of any HTAP benchmarks with skewed data generation.While there are some OLTP benchmarks with skewed datasets [23], they do not provide the required OLAP complexity for the index selection problem to be interesting.Due to the limitations in existing benchmarks we decided to extend the TPC-H skew benchmark to include INSERT, DELETE and UPDATE queries to mimic a skewed HTAP benchmark [57].Data generation tool for TPC-H skew already provides the functionality to generate data for inserts and deletes.In our extension we additionally perform updates on existing records using the data generated for the new inserts.To highlight the impact of skewness on the overall performance, we report comparable HTAP results on TPC-H database as well.
Unless stated otherwise, all experiments use scale factor (SF) 10, resulting in approximately 10GB of data per workload, except in the case of the IMDb dataset which has a fixed size of 6GB. 4 We consider three broad types of workloads, allowing us to compare different aspects of the recommendation process: Across experiments, each group of templatized queries is invoked over rounds, producing different query instances.For static and dynamic settings, PDTool is invoked every time after the first round of new queries, with those queries given as the training workload, since this workload will become representative of future rounds.This setting is somewhat unrealistic and favourable for PDTool, since in real-life the PDTool will seldom truly have knowledge of the representative workload (i.e., what is yet to arrive in the future), advantaging the PDTool in our experiments.However, it presents a viable comparison against the workload-oblivious MAB.Bandits do not use any workload information ahead of time, but instead observe a workload sequence and react accordingly.
Physical design tuning parameters.Both PD-Tool and MAB are given a memory budget approximately equal to the size of the data (1x; 10GB for SF 10 datasets and 6GB for IMDb dataset) for the creation of secondary indices.We have experimented with different memory budgets ranging from 0.25x to 2x (since benefits of additional memory seem to diminish beyond a 2x limit) under TPC-H and TPC-H skew benchmarks, and observed the same patterns throughout that range. 5We have naturally picked the middle of the active region (1x) as our default memory budget.All these workloads come with original primary and foreign keys that influence the choice of indices.We grant the aforementioned memory budget on top of this.In search of the best possible design, we do not constrain the running time of PDTool, with one exception: In TPC-DS dynamic random, PDTool was uncompetitive due to long running times, 6 hence the PDTool running time of each invocation was restricted to 1 hour.All proposed indices are materialised and queries invoked over the same commercial DBMS in both cases (MAB and PDTool).
Metrics.In addition to reporting total end-to-end workload time for all rounds, we also report the total workload time per round used to demonstrate the convergence of different tools.Additionally, we present the total workload time broken down by recommendation time (when invoking the PDTool or the MAB framework), index creation time, and workload execution time.For completeness, we show original query times, without any secondary indices (denoted as NoIndex).In addition to convergence graphs of individual benchmarks, we present a summary graph with total end-to-end workload time for all rounds under MAB and PDTool tuning of SSB, TPC-H (uniform), TPC-H skew, TPC-DS and IMDb benchmarks.
Hardware.All experiments are performed on a server equipped with 2x 24 Core Xeon Platinum 8260 at 2.4GHz, 1.1TB RAM, and 50TB disk (10K RPM) running Windows Server 2016.We report cold runs, clearing database buffer caches prior to every query execution.

Experimental Results
In this section, we report on wide ranging empirical comparisons of MAB and PDTool, while supporting both analytical and HTAP workloads.Finally, we present results against a well tuned reinforcement learning agent.

MAB vs PDTool Under Analytical Workloads
Static workloads.Static workloads over uniform datasets are the best case for offline physical design tools, as a pre-determined workload sequence may perfectly represent future queries.However, when underlying data is skewed, recommendations based on a pre-determined workload alone can have unfavourable outcomes.While used for reporting, static workloads do not reflect modern dynamic workloads (e.g., data exploration) [33].In static workloads, all query templates in the benchmark (22, 13, 99 and 33 templates for TPC-H, SSB, TPC-DS and IMDb, respectively) are invoked once every round, each with a different query instance of the template, for a total of 25 rounds, providing sufficient time for convergence.
Figure 4 displays overall workload time (including recommendation and index creation time) for all 25 rounds under MAB and PDTool.For skewed datasets (TPC-H Skew, TPC-DS and IMDb) MAB outperforms PDTool.MAB shows over 17%, 28% and 11% performance gain against PDTool, under TPC-H Skew, TPC-DS, IMDb benchmarks, respectively.Under uniform datasets (TPC-H and SSB), both MAB and PDTool provide significant performance gains over NoIndex (over 50% and 85%, respectively), while PDTool outperforms the MAB (by 19% and 5%).This is not surprising since for fully analytical, uniform, static experiments usually align with PDTool assumptions where the future can be perfectly represented by a pre-determined workload.Convergence plots in Figure 3(a-e), show MAB's gradual improvement over 25 rounds.Both MAB and PDTool have large spikes after the first round for all the workloads.For both tools, this is due to recommendation and creation of indices.However, MAB might drop proposed indices and create new ones later on, generating relatively smaller spikes in subsequent rounds.Nonetheless, MAB efficiently balances the exploration of new indices, reducing exploration with time.
What is the best search strategy?Comparison of execution times in the final round of the static experiment provides a clear idea about the benefit of using execution cost guided search.As evident from Figure 3(a-e), in 4 out of 5 cases, MAB converges to a better configuration than PDTool.MAB provides over 5%, 84%, 31% and 19% better execution time by the last round (25 th ) compared to PDTool under SSB, TPC-H Skew, TPC-DS and IMDb, respectively.
For TPC-H skew, PDTool misses a vital index on Orders.O custkey.This index boosts the performance of some queries (Q22 in particular) which MAB correctly detects and materialises.Missing this leads to large execution times in a few rounds including the last round (8, 12, 17, 20 and 25) for PDTool.These experiments illustrate the risk of relying on the query optimiser and imperfect statistics as a single source of truth.
The only case when MAB is outperformed by the PDTool is under TPC-H (PDTool delivers over 21% better execution time by the last round); different indices are proposed, as our current MAB framework does not support an index merging phase employed by some physical design tools [15].Instead, MAB uses individual queries to propose index candidates.We plan to address index merging in future work.
Dynamic shifting workloads.Under the dynamic shifting workloads, all query templates in the benchmark are randomly divided into 4 equal-sized groups.A group of query templates is then executed for 20 rounds, after which the workload switches to a new group of unseen queries (no overlap with the previous queries).When the workload switches, PDTool is invoked and trained on the new sequence of queries (whose templates will be used in the next 19 rounds). 7hus, PDTool is invoked four times in total (in rounds 2, 22, 42, 62).On the other hand, the MAB framework does not assume any workload knowledge.
Interestingly, NoIndex performs better than PDTool against the IMDb workload.PDTool has a higher total workload time as well as higher execution time compared to NoIndex.NoIndex provides 3.5% (24 seconds) speedup in execution time over PDTool.This is mainly due to misestimates of the optimiser [22].As an example (out of many), query 18 takes less than 1 second under NoIndex, whereas with the created indices by PDTool some instances of this query take around 7-8 seconds due to a suboptimal plan chosen favouring the index usage.This affects both MAB and PDTool, but MAB identifies the indices with a negative impact based on the reward and drops them.For the IMDb workload which does not get much support from indices, MAB provides 3% total performance gain and 26% execution time gain compared to NoIndex.
One can easily observe the workload shifts in Figure 5(a-e) due to the spikes in rounds 2, 22, 42, and 62.For PDTool, this is due to the invocation of PDTool and index creation after the workload shifts.Similar spikes can be seen in the MAB line with automatic detection of workload shifts.Further random spikes can be observed, for PDTool, from rounds 20-40 in TPC-H skew and rounds 30-40 under IMDb, due to the issues discussed in the previous paragraphs (Q22 in TPC-H Skew, Q18 in IMDb).
Dynamic random workloads.We simulate modern data analytics workloads that are truly ad-hoc in nature.For instance, cloud providers, hosting millions of databases, seldom can detect representative queries, since they frequently change [22].In such cases, it is common to invoke the PDTool periodically (e.g., nightly or weekly), with the training workload taken to be queries observed since last invocation.In this setting, we invoke the PDTool every 4 rounds, using queries from the last 4 rounds as the representative workload.In the dynamic random setting, the number of total training queries in the complete sequence is similar to the number of queries we had in the static setting.However, we have no control over the selection of queries for the workload and they are chosen completely randomly.The sequence is then divided into 25 equal-sized rounds.In all cases, the round-to-round repeat workload was between 45-54%.
As shown in Figure 8, again we see a considerably lower total workload time of MAB compared to PDTool.MAB provides over 37%, 17%, 11%, 61% and 75% speedup compared to PDTool, under SSB, TPC-H, TPC-H Skew, TPC-DS and IMDb, respectively.It is notable that in Figure 8, the total workload time of PDTool climbs higher than NoIndex on two occasions, in TPC-DS and IMDb.In IMDb, this is due to the same issue discussed previously under dynamic shifting workloads (due to the optimiser's misestimates, favouring the usage of sub-optimal indices, e.g., IMDb Q18).While PDTool has a much better execution time than NoIndex under TPC-DS (execution time of 5.3h under PDTool vs 9.3h under NoIndex), due to high recommendation time (5.1 hours, see Table 1), PDTool ends up with a higher total workload time.Under these 2 benchmarks (TPC-DS and IMDb), MAB provides over 55% and 1.5% performance gain over NoIndex, respectively.In Figure 7(a-e), we can see five major spikes for PDTool due to the tuning invocations (in rounds 5, 9, 13, 17, 21).

The Impact of Database Size
To examine the impact of database size, we run TPC-H uniform and TPC-H Skew static experiments on SF 1, 10 and 100 databases.As previously discussed, under SF 10, MAB performed better in the case of TPC-H Skew and PDTool performed better on TPC-H (see Table 2).The impact of sub-optimal index choices is even more evident for larger databases, leading to a huge gap between total workload times of MAB and PDTool for TPC-H Skew (44 hours in the former vs 20 hours in the latter case).In TPC-H, PDTool results in a higher total workload time (14.8 hours vs. 13.2 hours for MAB).This is mainly due to sub-optimal optimiser decisions, where the optimiser favours the usage of indices (coupled with nested loops joins) when alternative plans would be a better option.For instance, under the recommended indices from PDTool, some instances of Q5 run longer than 8 minutes (using index nested loops join), where others finish in 1.5 minutes (using a plan based on hash joins).We notice that, with larger database sizes, execution time dominates contributing more than 91% to the total workload time.We observe faster and more accurate convergence of MAB under larger databases, due to a clear difference between rewards for different arms, highlighting MAB's excellent potential benefits for larger databases.

Hypothetical Index Creation vs Actual Index Creation
Managing the exploration-exploitation balance under a large number of candidate indices, with an enormous number of combinatorial choices, is non trivial.PDTool explores using the "what-if" analysis, which comes under the tool's recommendation time, whereas MAB explores using index creations.
Cost of hypothetical index creation: When analysing PDTool's average invocation times in dynamic shifting (small workloads) and dynamic random (large workloads) settings, it becomes evident that PDTool invocation cost grows noticeably with the training workload size, under all benchmarks (see Table 1).As an example, PDTool tuning of the TPC-DS benchmark grows from 3 minutes in the dynamic shifting setting (25-query workload) to 1 hour in the dynamic random setting (400-query workload).Furthermore, multiple invocations required in dynamic random and shifting settings aggravate the problem further for PDTool (see Table 1).On the other hand, PDTool recommendation time rapidly increases with the complexity of the workloads.In an experiment with 100 query workloads from SSB, TPC-H and TPC-DS (with the complexity of SSB < TPC-H < TPC-DS), it is evident that the complexity of a workload has a considerable impact on the PDTool recommendation time (see Table 3).
MAB recommendation times stay significantly lower and stable despite the workload shifts and changes in complexity or size (see Tables 1 and 3).In all experiments, MAB takes less than 1% of the total workload time for recommendation, except for IMDb where it takes around 2% (due to low total workload time and a high number of query templates).More than 80% of this recommendation time is spent on the initial setup (1 st round) and the continuous overhead is negligible.
Cost of actual index creation: While actual execution statistics based search allows the MAB to converge to better configurations, as a down side, MAB spends more time on index creation (see Table 1).For instance, under TPC-H and TPC-H Skew static experiments, MAB spends 5.6 and 19.8 minutes on index creation where PDTool only spends 2.4 and 8.3 minutes, respectively.Under skewed data, rewards show more variability which delays the convergence for MAB.This leads to higher exploration and greater creation costs.While MAB is still competitive due to efficient exploration, we consider ways to improve its convergence in future work.
Final verdict: Comparing the total of recommendation and index creation times (henceforth referred to as exploration cost) between MAB and PDTool presents a clear picture about these two exploration methods.From Table 1 we can observe that, in most cases (9 out of 15) MAB archives a better exploration cost compared to PDTool when running analytical workloads.However when the workload is small (e.g., dynamic shifting) PDTool tends to perform better.TPC-DS, with the highest number of candidate indices among these benchmarks (over 3200 indices), provides a great test case for exploration efficiency.Under TPC-DS, MAB exploration cost is significantly lower in shifting and random settings, and marginally higher in the static setting.Despite the efficient exploration, MAB does not sacrifice recommendation quality in any way (better execution costs in 12 out of 15 cases, with significantly better execution costs under all cases of TPC-DS).This efficient exploration is promoted by the linear reward-context relationship along with C 2 UCB's weight sharing (Section 3), resulting in a small number of parameters to learn.An arm's identity becomes irrelevant and context (Section 4) becomes the sole determining factor of each arm's expected score, which allows MAB to predict the UCB of a newly arriving arm with known context without trying it even once.All three round sizes converge to the same performant configurations by the last round.We observe a faster convergence with small round sizes, resulting in lower execution costs in the first few rounds.While the execution cost gain from 1x to 0.5x is noticeable, dividing the round further (single query) does not provide a considerable benefit compared to the added creation and recommendation overhead.With larger round sizes, we observe lower creation costs due to less frequent bandit updates (see Table 4).MAB performs better under all round-sizes compared to PDTool.A DBA can decide on the round size (bandit invocation frequency) based on the application and DBA's primary goal (faster convergence vs low creation cost).We leave auto-tuning of this parameter as an interesting future work avenue.

The Impact of Data Skew in Analytical Workloads
While PDTool outperforms MAB in TPC-H (uniform) by 19% in total workload time, MAB outperforms PD-Tool by 17% on TPC-H skew with Zipfian factor 4.
To further investigate the impact of the degree of data skew, we experiment with different Zipfian factors ranging from 1 to 3. As shown in Figure 9 under Zipfian factors 2 and 3, MAB showed over 51% and 58% performance gain against PDTool, respectively.Whereas under Zipfian factor 1, PDTool outperformed the MAB by 16%.PDTool missing the index Orders.O custkey appears to be more costly with Zipfian factors 2 and 3, mainly affecting Q22.
To provide further explanation, in the Zipfian distribution, frequency of an item is inversely proportional to its rank in the frequency ordered list.For example, with z=4, the most frequent customer in the order table is repeated 86k times, whereas the second most fre- quent customer is only repeated 17k times (80% drop).
Similarly, for z=2, the frequency decreases from 227k to 101k (46% drop).The percentage drop in frequency between adjacent ranks increases with the Zipfian factor.We observed that the running time of Q22 depends on the frequency of the most frequent customer in the order table (which is 50k, 227k, 156k, 86k for z values 1 to 4, respectively), more than the skewness, which explains the results shown in Figure 9.

Optimal Configurations
With many possible configurations for even the smallest databases, finding an optimal configuration under given memory budget, while considering performance regressions, is a non trivial task.However, when the memory budget limitation is lifted, we can estimate optimal configurations with a set of covering indices.This configuration occupies around 40 GB of space for both TPC-H and TPC-H Skew.Under this budget, PDTool and MAB both converged to the optimal configuration by the final round.

MAB vs PDTool Under HTAP Workloads
In this section, we demonstrate MAB's ability to efficiently tune indices under HTAP workloads, which are well known for their complexity.Static workloads.In Section 7.1, we observed better performance from PDTool under fully analytical static workloads on uniform datasets (TPC-H and SSB), which serves as the best case for offline tuning tools such as PDTool.In this section, we show that, even under repeating workloads on uniform datasets, the addition of transactional queries can create problems for PDTool.To illustrate the impact of transactional queries, we use The transactional component of the workload is composed of 5 transactions (new-order, payment, orderstatus, delivery and stock-level) with a pre-specified transaction mixture (44%, 44%, 4%, 4% and 4%, respectively).The smallest transactional workload adhering to the specified TPC-C transaction mixture comprises 11 new order transactions, 11 payment transactions, an order-status transaction, a delivery transaction and a stock-level transaction (approximately 650 queries).This smallest transactional workload is henceforth referred to as a set of transactional queries.We define the transactional to analytical ratio (henceforth referred to as TAR) as the ratio between transactional and analytical query sets.As an example, 5:1 TAR is composed of one analytical query set (22 TPC-H queries) and 5 transactional query sets (i.e., 55 new order transactions, 55 payment transactions, 5 orderstatus transactions, 5 delivery transactions and 5 stocklevel transactions, resulting in approximately 3300 transactional queries per round).
As evident from Figure 10, in transaction-heavy workloads, MAB performs much better than PDTool providing up to 51% speed-up (5:1) in total workload time, whereas PDTool performs better in fully analytical workloads providing up to 8% speed-up (0:1).This result is similar to the results observed under the static setting with TPC-H and SSB workloads where the data is uniformly distributed, and the workload is fully analytical.At 1:1 TAR, which is the first ratio that introduces the transactional component, MAB starts to  take the lead providing a 4% performance gain in total workload time.MAB manages to reach the same last round execution time as PDTool; however, due to better execution times in early rounds, PDTool provides 7.7% total execution time speed up over MAB.From 2:1 TAR onwards, MAB dominates the PDTool providing 26%, 47%, 51% and 51% total workload time speed-up over PDTool, under 2:1, 3:1, 4:1, 5:1 TARs, respectively.PDTool struggles to perform better than NoIndex from 3:1 TAR onwards due to the heavy recommendation costs incurred by PDTool, yet PDTool is still superior to NoIndex in execution cost.
To further understand the results, we dive into the 3:1 TAR experiment, which has enough transactional queries to demonstrate the importance of both analytical gain and transactional overhead.As shown in the convergence graphs in Figure 11, despite the high recommendation cost in the first round, PDTool provides much better per round total workload times compared to NoIndex.However, MAB converges to an even better total workload time by the last round.As clearly visible from Figure 11 (b), MAB converges to a better configuration (providing 29.4% better execution time by the last round) compared to PDTool.
Balancing the configuration fitness between transactional workloads and analytical workloads is the prime concern of index tuning in HTAP environments.How both tools achieve this balance can be better understood by breaking the execution cost into analytical and transactional components.As Figure 12 demonstrates, MAB configuration provides better execution time for both analytical and transactional workload components (19.9% and 43.9% better execution times by 25 th round for analytical and transactional workloads, respectively).In initial rounds, MAB performance is inferior to PDTool in transactional execution time but it quickly learns the negative impact of indices on the transactional workload.By the 4 th round, MAB surpasses PDTool in transactional execution time by dropping indices with negative rewards.Note that, while removing the unnecessary indices, MAB makes sure not to impact the analytical execution times by keeping the high reward indices intact.MAB performs several configuration changes in rounds 15-17, which results in a sudden oscillation in transactional execution time, but these configuration changes allow the bandit to find a superior configuration in both analytical and transactional execution costs.There is some variability in transactional execution costs even with the same number of transactions in each round, as the number of queries per round can be different (e.g., different new orders can have a different number of items in a transaction, leading to a different number of queries).

The Impact of Data Skew in HTAP Workloads
We now experiment with TPC-H and TPC-H Skew HTAP workloads to demonstrate the impact of the ad- dition of transactional queries to well-known benchmarks that we already examined in Section 7.1.We experiment with a similar number of transactional queries as in the 3:1 TAR CH-BenCHmark experiment.The OLTP part of the workload is composed of 6 templates (two insert templates, two delete templates and two update templates).We use the original data generation tools to generate the insert and delete data tuples for ORDER and LINEITEM tables.Additionally, we use the same tool's insert data to populate our update queries over the same tables.The transactional workload used here is simpler than CH-BenCHmark, but sufficient to demonstrate the impact HTAP workloads have on the overall performance.As shown in Figure 13, MAB performs better in both TPC-H and TPC-H skew HTAP workloads.Interestingly, MAB manages to achieve 5.2% better total workload time even under the TPC-H benchmark, whereas PDTool previously outperformed MAB by 19% under the fully analytical static setting.MAB converges to a similar performant configuration, comparing the final round execution cost, whereas, under the fully analytical setting, there was a considerable gap between MAB and PDTool execution times even at the last round.However, due to MAB's longer execution times in the first few rounds, PDTool achieves a 6% better total execution time.Due to the higher recommendation time in PDTool, PDTool ends up having a higher total workload time.In TPC-H Skew, MAB dominates the PDTool on all ends, having 28.4% better total workload time and 24% better execution time.All HTAP results are summarised in Table 5.

Dynamic HTAP Workloads
In this experiment, we gradually increase and decrease the transactional workload size over the rounds.We start with 0:1 TAR, which is purely analytical, and then we add transactional workload sets one by one till we reach 5:1 TAR.Afterwards, we gradually reduce the transactions workload sets one by one to reach 0:1 TAR again.We run 20 rounds in each TAR.Soon after each workload change, PDTool is invoked with the new workload from the previous round, which is a good representation of the next 19 rounds.It is essential to provide the workload from at least a complete single round as PDTool takes the transactional to analytical ratio into account when making recommendations.However, as observable from Figure 14, each in- vocation of PDTool takes a substantial amount of time for larger workloads in higher transaction levels.We can observe that MAB performs most of the configuration changes at the start of the experiment and then after the first workload change (0:1 → 1:1).It is understandable that MAB needs to explore extensively at the start of the experiment.However, a similar level of exploration for the first workload change might not be as intuitive, given that we do not see such exploration from MAB for the rest of the experiment.At round 21, MAB is exposed to transactional queries for the first time in this experiment.MAB performs more exploration and learns the negative impact of indices in these rounds.We can see MAB doing much better after round 40.MAB provide a 57.8% speed up in the total workload time compared to PDTool.A significant portion of this speed-up is attributed to MAB's lower recommendation cost compared to PDTool.
To compare the different configurations proposed by two tools, we plot execution time over the rounds in Figure 15.After the first two transaction levels (i.e., after round 40), MAB always manages to lock into a better configuration providing faster execution time.MAB provides an 11% speed-up in total execution cost.
In the entire experiment, we go through the same TAR two times (except for 5:1 TAR), which results in similar workloads.However, from Figure 15, it is noticeable that PDTool reaches higher execution costs in the descending part of the experiment (after round 120) compared to the ascending part of the experiment (rounds 1 to 120).Configurations proposed by the PDTool depend on the existing secondary indices present in the system and the underlying data.Continuous additions and deletions change the underlying data, somewhat invalidating the statistics used by the optimiser.Furthermore, at each PDTool invocation, the system has a different starting set of secondary indices, impacting PDTool recommendations.Therefore the recommendations proposed for similar workloads in ascending and descending parts of the graph are different.For example, rounds 80-100 and 120-140 run a 4:1 TAR workload, nonetheless PDTool proposes two very different configurations for these two sections of the experiment.While it proposes only 18 indices at round 81, 27 indices are proposed in round 121, with only 11 indices being shared between 2 configurations.On the other hand, MAB converges quickly in the later rounds of the experiment, taking advantage of the already obtained knowledge.
To further analyse MAB's gain in the dynamic experiment, we need to break down the execution time into analytical and transactional components.As one can observe from Figures 16 and 17 MAB is obtaining the gain mainly from the transactional workload.MAB provides 4.5% better analytical execution cost and 22.6% better transactional execution cost compared to the PDTool.As observable from Figure 16, MAB is obtaining a noticeable analytical gain from 2:1 and 3:1 TARs.In the analytical heavy workloads (0:1, 1:1 TARs), PDTool records a better or similar analytical execution time.MAB opts for the transactional friendly configuration for transactional heavy workloads (4:1, 5:1 TARs), reducing thereby the analytical execution time gain.On the transactional end, MAB leads in almost all workloads.As expected, transactional execution cost gain increases for the transactional heavy workloads.Again we see a clear increase in transactional workload execution times for PDTool in the descending part of the graph.This is due to the different configurations proposed by the PDTool, as discussed above.

Space Savings Under HTAP Workloads
In the case of index tuning of HTAP workloads, more indices can result in higher running time for transactional queries.Consequently, a minimal index set can be optimal for a transaction-heavy workload.While such a configuration might be suboptimal for the analytical component of the workload, a minimal configuration can result in a better total query execution time due to the significant savings obtained from avoiding index maintenance activities stemming from transactional queries.In our experiments, we observed that PDTool usually exploits the entire given memory budget, resulting in a higher transactional execution cost. 8n the flip side, MAB learns the negative impact of indices on transactional queries and dynamically ad- justs the configuration.This behaviour was not visible when the workload was fully analytical and is only observable in transaction-heavy workloads.We notice that MAB context helps the bandit to find smaller configurations with higher execution cost gains.MAB context includes the index size as a context feature, and this is typically learned as a negative weight due to high negative rewards from index creation operations.This property then forces the bandit to choose the smallest arms that provide the best gains in execution cost.At 5:1 TAR, MAB provides a configuration that yields an 83% memory saving while achieving an 8.8% execution time gain by the last round (see Figure 18).This execution cost gain is smaller than the gain we observed under transaction level 3, as the usefulness of indices reduces when the workload becomes transaction heavy.
While it might be counter-intuitive for an index tuning tool to use less memory to provide better performing configurations, it can be easily understood by observing the analytical and transactional execution times of both tools by the last round (See Figure 19).Comparing the last round configurations of both tools, PDTool provides an analytical friendly configuration that provides a 27% speed-up in analytical execution time (around 40 second gain per round).On the other hand, MAB locks into a smaller configuration that is more suitable for transactional queries providing a 60% speed-up in transactional execution cost (around 60 second gain per round).Ultimately MAB provides an 8.8% speed-up in total execution time even after a significant memory saving.

Why Not (General) Reinforcement Learning?
Past efforts have considered more general reinforcement learning (RL) for physical design tuning [62,8].Compared to most MAB approaches, deep RL invites over parameterisation, which can slow convergence (see Figure 20), whereas MAB typically provides better convergence, simpler implementation, and safety guarantees via strategic exploration and knowledge transfer (see Section 3).Due to its randomisation, RL can also suffer from performance volatility as compared to C 2 UCB, a deterministic algorithm.
Experimental setup.The above intuition is supported by experiments with more general RL, where we evaluate the popular DDQN RL agent [30].We run the static 10GB TPC-H and TPC-H Skew analytical benchmarks over 100 rounds and present results in Figure 20.For a fair comparison, we combine all of MAB's arms' contexts as DDQN state.We also present the same set of candidate indices to the DDQN.For the DDQN's neural network hyperparameters, we followed the experiment of [62] by setting 4 hidden layers, with 8 neurons each.The discount factor γ is set to 0.99 and the exploration parameter is set to 1 at the first sample, decaying to 0 with exponential rate reaching 0.01 in the 2400 th sample.One sample corresponds to one index chosen by the agent.In the beginning of the round, if the agent decides to explore, then the choice of the set of indices will be randomly made for that entire round.These experiments are repeated ten times, reporting either average value (Figure 20 (a) and (b)) or median ((c) and (d)) along with inter-quartile range.For completeness, we include the case of only using single column indices (DDQN-SC in Figure 20), as originally proposed in [62].
Evaluation.Due to DDQN-SC's reduced search space in some scenarios, it might not be possible to find an optimal configuration for a workload.This is evident from Figure 20  tively.Interestingly, under TPCH-Skew, where the demand for exploration is higher, DDQN-SC has a lower total workload time than DDQN due to the noticeably small index creation times of single column indices (1.5 hours vs 5.8 hours, respectively).Under both TPC-H and TPC-H Skew, MAB performs significantly better, providing 35% and 58% speed-up against the better RL alternative, respectively.No state transitions.A strength of more general reinforcement learning is its ability to take into account (random) state transitions when actions are taken.However, the importance of state transition in online index selection is unclear.While modelled state could include the collection of indices that exist in the system, actions (i.e., choosing an index) deterministically govern the ensuing state.State could also model characteristics of the next round's queries.However these queries do not depend on the prior action, thus it is appropriate to take successor query state as independent of action, as promoted by bandits.Hence, adopting more general RL provides no clear benefit over MAB, while imposing delay to convergence as demonstrated in Figure 20 (c) and (d), and passing over MAB-style performance guarantees (see Section 3).
Hyperparameter search space.Deep RL is notorious for challenging hyperparameter tuning.For example, in this experiment, we have to decide: the number of layers of the neural network, the neurons per layer, activation functions and loss, the exploration parameter ε, and discount factor γ. C 2 UCB has just λwhich becomes less relevant as rounds are observedand α which controls exploration.
Volatility of deep RL.Most deep RL algorithms randomise to explore vast state-action spaces.This is not the case with C 2 UCB.Extending UCB, deterministic C 2 UCB is capable of identifying underexplored arms through their context vectors.The only (rare and as such not strictly necessary) case when C 2 UCB is random is where the MAB must tie-break arms.A more significant cause of stability of our MAB is its small parametrisation compared to deep learner-based RL.Combined, the stable MAB yields a more consistent result, as can be seen in Figure 20(a) and (b).Much wider variance on the DDQN plot demonstrates how the performance of DDQN can vary significantly, compared to the narrow error bars on the MAB, which demonstrates the algorithm's stability.
8 Related Work HTAP.HTAP workloads are composed of online transaction processing (OLTP) workloads and online analytical processing (OLAP) workloads.While most existing analytical systems depend on data pipelines writing to a separate data warehouse for OLAP queries, such an approach limits the user from running analytics on fresh operational data.Research has targeted hybrid environments that can cater to OLTP and OLAP queries.The last few years have witnessed the emergence of HTAP focused database architectures, platforms and databases [5,6,7,47,32,59,37], commercial tools [42,43,66], benchmarks [19,23,18].This rapid growth of research and commercial interest in HTAP environments highlights a pain point of efficiently processing analytical and transactional queries over the same dataset.
Automated physical design tuning.Most commercial DBMS vendors nowadays offer physical design tools in their products [3,68,21].These tools rely heavily on the query optimiser to compare benefits of different design structures without materialisation [14].Such an approach is ineffective when base data statistics are unavailable, skewed, or change dynamically [16].In these dynamic environments, the problem of physical design is aggravated: a) deciding when to call a tuning process is not straightforward; and b) deciding what is a representative training workload is a challenge.
Online physical design tuning.Several research groups have recognised these problems and have offered lightweight solutions to physical design tuning [12,13,60,61].While such solutions are more flexible and need not know the workload in advance, they are typically limited in terms of applicability to new unknown workloads (generalisation beyond past), and do not come with theoretical guarantees that extend to actual runtime conditions.Moreover, by giving the optimiser a central role, the tools remain susceptible to its mistakes [24].[22] extends [3] with the use of additional components, in a narrowed scope of index selection to mimic an online tool.This takes corrective actions against the optimiser mistakes through a validation process.
Adaptive and learning indices.Another dimension of online physical design tuning is database cracking and adaptive indexing that smooth the creation cost of indices by piggybacking on query execution [34,28].Recent efforts have gone a step further and proposed replacing data structures with learned models that are smaller in size and faster to query [41,25,35].Such approaches are complementary to our efforts: once the data structures (or models) are materialised inside a DBMS, the MAB framework can be used to automate the decision making as to which data structure should be used to speed-up query analysis.
Learning approaches to optimisation and tuning.Recent years have witnessed new machine learning approaches to automate decision-making processes within databases.For instance, reinforcement learning approaches have been used for query optimisation and join ordering [36,65,39,49].In [24], regression has been used to successfully mitigate the optimiser's cost misestimates as a path toward more robust index selection.[24] shows promising results when avoiding query regressions.However, this classifier incurs up to 10% recommendation time, impacting recommendation cost in all cases, especially where recommendation cost already dominates the cost for PDTool (e.g., TPC-DS, IMDb).
When it comes to tuning, the closest approaches employ variants of RL for index selection or partitioning [62,31,8] or configuration tuning [56].[8] describes RL-based index selection, which depends solely on the recommendation tool for query-level recommendations and is affected by decision combinatorial explosion, both issues addressed in our work.Unlike its more general counterpart (RL), MABs have advantages of faster convergences as demonstrated in Section 7.3, simple implementation, and theoretical guarantees.There has been recent interest in using bandits for database tasks such as monitoring, query optimisation and join ordering [29,48,27].

Discussion and Future Avenues
This paper scratches the surface of the numerous opportunities for applying bandit learners to performance tuning of databases.We now discuss a rich research vision for the area.
Multi-tenant environments.A crucial advantage of the MAB setting is theoretical guarantees on the fitness of proposed indices to observed run-time conditions.This is critical for production systems in the cloud and multi-tenant environments [24,52,22], where analytical modelling is impossible due to unpredictable changes in run-time conditions.The MAB approach on the contrary eschews the optimiser and modelling completely, choosing indices based on observed query performance and is thus equally applicable to these challenging environments.
Beyond index choices.Despite focusing solely on the task of index selection in this paper, the MAB framework is equally applicable to other physical design choices, such as materialised views selection, statistics collection, or even selection of design structures that are a mix of traditional and approximate data structures, such as learned models [41] or other fine-grained design primitives [35].Furthermore, MABs can be used in other areas in databases which require strategic exploration under theoretical guarantees like dynamic memory allocation, query optimisation, and database monitoring.
Cold-start problem.Under the current setup, MAB starts without any secondary indices or knowledge about their benefits, forming a cold-start problem and leading to higher creation costs.While MAB is already superior against PDTool even with the creation cost burden (see Section 7.1.2),even faster convergence and better creation costs can be provided by pretraining models in hypothetical rounds (using what-if) or workload forecasting [46] to improve context quality.
Opportunities for bandit learning.The increased search space of possible design choices calls for advancements to bandits algorithms and theory, where unbounded/infinite numbers of arms will be increasingly important.Similarly, various flavours of physical design might ask for novel bandits that adopt the notion of heterogeneous arms (indices, views, or statistics), or hierarchical models where individual choices at lower levels (e.g., the choice of indices or materialised views) influence decisions at a higher level (e.g., index merging due to memory constraints).

Conclusions
This paper develops a multi-armed bandit learning framework for online index selection.Benefits include eschewing the DBA and the (error-prone) query optimiser by learning the benefits of indices through strategic exploration and observation.We justify our choice of MAB over general reinforcement learning for online index tuning, comparing MAB against DDQN, a popular RL algorithm based on deep neural networks, demonstrating significantly faster convergence of the MAB.Furthermore, our extensive experimental evaluation demonstrates advantages of MAB over an existing commercial physical design tool (up to 75% speed up, and 23% on average), and exemplifies robustness to data skew, unpredictable ad-hoc workloads and complex HTAP environments.for any n ≥ 0 with probability at least 1 − δ.

A Theoretical Regret Analysis
The original bound leverages the following result.Unfortunately the original proof of this lemma suffers from an error when the following incorrect claim is made.
Claim Let k, T be natural numbers, V be a d × d real and positive definite matrix, and s t ⊆ {1, • • • , k} so that |s t | ≤ k for t ∈ {1, • • • , T }.Let x t (i) ∈ R d be vectors for t ∈ {1, • • • , T }, i ∈ {1, • • • , k}, and define the matrix containing the sum of the outer products of the contexts to be V T = V + T t=1 i∈s t x t (i)x t (i) .If we define a M = √ a Ma, then det(V T ) = det(V) T t=1 1 + i∈s t x t (i) 2 Counterexamples to this claim are found by inserting arbitrary values of variables when T = 2.We relax the claim's relation, as proven in [55], by instead proving an inequality: Fortunately, this still achieves a proof of Lemma 3 as originally stated with non-trivial modification to the proof but with no further changes needed in proving Theorem 1.
We begin our proof by collecting the contexts of the super arm X T = x T (s (1,T ) ) . . .x T (s (|S T |,T ) ) .Then, where the fourth and final equalities follow from the Generalised Matrix Determinant Lemma and the fact that adding the identity to a square matrix increases eigenvalues by one.Now, the final line's product term can be expanded as Since V is positive definite and x t (i)x t (i) is positive semi-definite (with one eigenvalue being x t (i) x t (i) and the remainder all zero) for all t and i, we have that V T −1 = V + T −1 t=1 i∈s t x t (i)x t (i) is positive definite.Therefore, we conclude that V −1 T −1 is also positive definite, hence it has a symmetric square root matrix V −1/2 T −1 .It also follows that X T V −1 T −1 X T is positive semi-definite.Therefore, the terms starting from the third term in the expansion (2) are all non-negative because they are products of the eigenvalues of X T V −1 T −1 X T .Thus we have, det(V T ) = det(V T −1 ) Applying our recurrence relation on V t for 1 ≤ t ≤ T , we can telescope to arrive at the wanted inequality.

Fig. 1
Fig. 1 An abstract view of the proposed bandit learning-based online index selection.

Fig. 9
Fig. 9 MAB vs. PDTool total end-to-end workload time under TPC-H skew static workloads with different Zipfian factors (z)

Fig. 10
Fig. 10 MAB vs. PDTool total workload time under CH-BenCHmark for static workloads with a range of different TARs

Fig. 13
Fig.13MAB vs. PDTool total end-to-end workload time for static TCP-H and TPC-H skew HTAP workloads.

Fig. 17
Fig. 17 MAB vs. PDTool transactional execution time convergence under CH-BenCHmark for dynamic workloads with different transaction levels

Table 1
Total time breakdown (min): the best choice is in bold text.

Table 2
Total end-to-end workload time for static workloads under different database sizes (min).

Table 4
TPC-H Skew benchmark under different round sizes (min)

Table 5
HTAP: Total time breakdown (min): the best choice is in bold text.