Introduction
Over the past decade, our ability to solve well-studied computationally challenging problems has increased substantially, as has the importance of high-performance solvers for these problems in the context of real-world applications. This can be seen, for example, in the case of the propositional satisfiability problem (SAT), one of the most prominent NP-complete combinatorial decision problems, which has important real-world applications in hard- and software verification (e.g., [1]).
Part of these advances have been incentivized by benchmarks and competitions, and the desire to outperform other solvers. Meanwhile, meta-algorithmic techniques, such as automated algorithm configuration (AAC) and automated algorithm selection (AAS), are also being used increasingly to advance the state of the art in solving a broad range of problems from AI and related areas. By carefully choosing parameter settings and algorithm components, AAC techniques are often able to achieve substantial performance gains. This has been shown, e.g., for SAT [2], [3], Max-SAT [4], the machine reassignment problem [5], mixed-integer programming (MIP) [2], [6], automated planning [7], and supervised machine learning [8], [9]. AAS techniques leverage the fact that different solvers perform best for different problem instances, and achieve better performance than stand -alone solvers by selecting from a given portfolio of solvers one (or more) solvers that are most suitable for a given instance [10]. AAS has seen successful application on a broad range of widely studied problems, including SAT [11], Max-SAT [12], MIP [13], [14], constraint programming (CP) [15], and AI planning [16].
Due to their substantial impact on performance, meta-algorithmic techniques, such as AAS and AAC, have become increasingly important for benchmarking and competitions. In both settings, traditional ranking and comparison schemes commonly represent the state of the art by relying on metrics that aggregate algorithm performance over a set of problem instances (e.g., average running time), and considering a single predefined parametrization for each algorithm (e.g., by using the default parameter settings). However, this typically fails to capture the true performance potential. AAC allows solvers to realize their performance potential by finding the best configuration for a specific set or distribution of problem instances. When all solvers are able to reach their full potential, this results in a more accurate representation of the state of the art. Similarly, as observed above in the context of AAS, the best solver is usually not the same for each instance. The true state of the art thus cannot be adequately represented by a single solver, as considered in traditional benchmarks and competitions, but requires the use of multiple solvers or solving techniques. Therefore, to more accurately assess the true state of the art, meta-algorithmic techniques, such as AAS and AAC have to be utilized. These observations have led to new types of competitions that integrate meta-algorithms, such as the configurable SAT solver challenge (CSSC) [3], and the per-instance selection-based Sparkle Challenges for SAT1 and AI planning.2
However, automatically improving performance by means of meta-algorithmic techniques poses substantial additional challenges. Meta-algorithms are internally complex, and their implementations are usually not easy to use. In addition, there are several well-documented pitfalls that can easily lead to unsatisfactory results. For example, in AAC [17], letting each algorithm that is being configured measure their own running time can lead to unreliable results, because they may all use different, inconsistent, implementations to measure this. Furthermore, large-scale performance evaluations and the effective use of meta-algorithmic techniques all require substantial computational resources, often benefiting from parallel computation on high-performance compute (HPC) clusters, which introduces additional complexity. Consequently, mistakes do not just lead to poor results, but can also be expensive to rectify and cause additional environmental impact through CO2 emissions.
Having defined the true state of the art in solving a given problem, and established the need for meta-algorithms (such as AAS and AAC) both to assess and advance it, we further observe that their use in practice (scientific as well as industry) is limited (e.g., in research, algorithms are primarily optimized manually or with simple methods, such as grid search [18]). Here, we introduce Sparkle,3 a platform that has been designed to lower the threshold for using these meta-algorithmic techniques effectively and correctly, avoiding common pitfalls and following best practices. Naturally, this should benefit the users of meta-algorithmic techniques, who will have an easier time advancing and staying up to date with the state of the art in solving challenging computational problems. In addition, developers of meta-algorithms may benefit from increased adoption of their methods when they are made available through Sparkle. In turn, attention to improving meta-algorithmic techniques may also grow. In this work, we give a detailed overview of the steps required for applying AAS and AAC, how these are implemented in Sparkle, and which steps are made easier compared to using AAS and AAC in a stand-alone fashion, supported by a small user study. The initial version of Sparkle presented here incorporates one prominent AAS and AAC system each (AutoFolio [19] and SMAC [20], respectively). These were chosen carefully to ensure high-quality and broadly usable systems.
In the following, we first discuss how Sparkle is situated in relation to other work (Section II). Then, in Section III, we cover the core design principles behind Sparkle. Section IV introduces the first major use case, where AAS is employed with parameter-less solvers to take advantage of their complementary strengths, and to accurately assess the true state of the art. Next, Section V considers parameterized solvers, and how they can benefit from AAC in Sparkle in a relatively hassle-free way. In Section VI, we briefly discuss how Sparkle can be used to enable new types of competitions. Section VII covers additional use cases enabled by the Sparkle platform. In Section VIII, we discuss best practices and pitfalls covered by this implementation of Sparkle. Finally, in Section IX, we draw several general conclusions and briefly outline future work.
Related Work
In the following, we consider tools that, like Sparkle, in some way support accessibility to, assessment of and/or advancement of the state of the art in solving computationally challenging problems. For each of these, we situate contributions to these goals against those made by Sparkle and highlight the gaps Sparkle can fill.
For single- and bi-objective evolutionary algorithms, the COCO framework [21] provides a well-established benchmarking environment, in which many standard tools required for experimentation with and benchmarking of these types of algorithms are readily available. In particular, COCO includes statistical analysis tools and standardized experimentation procedures, which help to avoid common mistakes (e.g., benchmark function variation, through rotations, etc., to avoid overfitting). Additionally, experimental results submitted by users are collected, which makes comparison with other algorithms easy. Support for meta-algorithmic procedures is currently not included.
With ParadisEO [22], evolutionary computation (EC) algorithms can be constructed from components. Through integrations with the algorithm configurator irace [23] and the experimentation platform IOHexperimenter [24], ParadisEO is able to automatically construct EC algorithms and measure their performance [25]. However, an integration like this only makes the configuration procedure available to the specific application and can therefore only be used in its specific context. In this case, ParadisEO is focused on EC algorithms and similar iterative optimization algorithms (as is the IOH framework). With Sparkle, on the other hand, we aim to support the use of AAC and other meta-algorithmic design procedures as generally as possible (and it is already broadly usable, e.g., for SAT solvers, AI planners, and MIP solvers).
SPOT [26] is an R package for hyperparameter optimization (HPO) and provides a number of related tools. Particularly, it focuses on surrogate-model-based parameter optimization to reduce computational cost. In addition to the core optimization tools, SPOT also includes functionality to visually compare parameters obtained as a result of the optimization process. Whereas SPOT focuses on a special case of AAC, Sparkle aims to make a broader range of meta-algorithmic techniques accessible to its users.
Existing algorithm configuration procedures, such as SMAC [20], irace [23], ParamILS [27], and GPS [28], are readily available for use by experts in algorithm configuration, but are often challenging to use for nonexperts. This usability issue stems from their focus on the core configuration process, which is sufficient for experts in algorithm configuration, while only limited support is provided for the overall configuration process that has to be considered in practice (this is elaborated further in Section V, Fig. 2). For algorithm selection, the situation is slightly better. Recognizing that different types of algorithm selectors and settings work best in different use cases, AutoFolio [19] automatically configures an algorithm selector for a given application scenario, using SMAC. Although this simplifies part of the process of constructing a high-performance algorithm selector, AutoFolio itself is not much more accessible to nonexperts than other algorithm selection systems.
Meta-algorithmic benchmarking libraries, such as AClib [29] and ASlib [30], provide scenarios for testing and benchmarking meta-algorithmic techniques. While ASlib limits itself purely to the scenarios to compare on, AClib also includes affordances for running a number of algorithm configurators on those scenarios. Finally, AClib provides tools for producing some basic statistics and plots for the configuration scenarios it has been used to run. Whereas ASlib and AClib are designed for AAS and AAC experts, respectively, Sparkle is designed to be accessible to allow nonexperts to benefit from techniques, such as AAS and AAC. In addition, Sparkle aims to better support users by generating reports that provide more details about the process that was used to obtain a given set of results.
For machine learning pipeline design, various AutoML frameworks exist, such as Auto-sklearn [31] and AutoGluon [32]. While these frameworks are limited to machine learning, this is not the case for Sparkle, which can in principle be applied to a much broader range of computational problems. In addition, Sparkle can also be used for simple performance evaluations and comparative performance analysis, which are usually not included in AutoML frameworks.
For machine learning problems, OpenML [33] collects datasets, experiments, algorithms, and results. This enables scientists to compare their approaches on the same datasets and under the same conditions. In addition, OpenML provides a wealth of data from many different runs of each algorithm on a range of datasets. While this is of great value for benchmarking and accurate performance comparisons, it markedly differs from Sparkle, which supports the broad use of meta-algorithmic techniques.
HAL [34] aimed to support similar functionality to Sparkle, with a particular focus on automated analysis and design of algorithms, including meta-algorithms. Unfortunately, HAL turned out to be over-engineered, in the sense that, while accommodating a wide range of functions at its release, it was difficult to install and use, resulting in limited adoption. Sparkle aims to avoid these issues, by ensuring that simple tasks are easy to carry out. To avoid an unnecessarily complex design, it is focused around simple, modular command scripts that are easy to use. Behind the scenes, these scripts may still call more complex classes and structures, but those too are all designed to have an easily understood interface that is called by the command scripts. To aid installation, Sparkle aims to automate the installation process to the largest extend possible, e.g., by automatically installing dependencies. Furthermore, unlike HAL, Sparkle automatically produces detailed reports.
Design
One of the main design principles underlying Sparkle is that it should be easy to get simple things done (e.g., adding a solver), and as easy as possible to achieve more complex goals (e.g., configuring an algorithm). Naturally, what is easy to do also depends on the level of expertise of different users with regard to the meta-algorithmic procedures made available through Sparkle. Here, we consider an expert user to be someone that is familiar with the specific meta-algorithm they want to use, and a nonexpert someone who is not. We note that, for this initial implementation of Sparkle, we expect nonexperts to be familiar with standard computer science concepts, such as programming and command line interfaces (CLIs). While Sparkle especially targets nonexperts, easier-to-use meta-algorithms should also be of benefit to expert users. These considerations resulted in a design that is primarily based on a relatively small set of commands that are largely self explanatory. By combining these commands, scripts can be written to specify and run experiments. For expert users, this should result in scripts that broadly follow the high-level processes they are already familiar with, although with a much reduced need to specify details, while nonexpert users will benefit from the scaffolding and abstraction afforded by this approach. For example, in algorithm selection, the instances, solvers, and feature extractor(s) are added to the system, after which the features and performance data are computed, and the portfolio selector and the report are constructed. This shows the usual process to construct a portfolio selector and makes clear what needs to be provided by the user, without requiring detailed instructions from the user on how to actually construct the selector, or on how to evaluate it.
To achieve a system that is as easy and safe to use as possible, we broadly consider the following three categories of usability throughout Sparkle.
Efficiency.
Correctness.
Understandability.
Efficiency primarily concerns how simple the process is for users to achieve their goals with Sparkle. This includes the commands they need to call (e.g.,
Correctness focuses on the correct execution of all processes initiated by the user. Here, the focus is on the internal operation of Sparkle. Since complex meta-algorithmic techniques are provided to potentially nonexpert users, support is needed in their correct use. To the largest extend possible, Sparkle aims to make sure that correct experimental procedures are followed (e.g., by correctly implementing the standard protocol for algorithm configuration from [35], also discussed later in Section V), and that potential pitfalls are avoided (see Section VIII). Since it is not always possible to guarantee correct use, Sparkle aims to provide adequate warnings and possible solutions when potential problems are detected that cannot be prevented by design (e.g., crashes of the algorithm that is being configured). As a result, at a minimum, the user should be made aware of potential problems (e.g., by writing warnings and errors to the command line) and should then be able to take action or seek help from experts.
Understandability is concerned with explaining the output and operation. The main vehicle for this is the reporting functionality of Sparkle. For each meta-algorithmic process (e.g., algorithm configuration), a report can be generated. This report then describes the experimental setup (e.g., which problem instances were used), the process that was followed (e.g., how performance is assessed), and the results, along with pertinent references to the literature. To show how Sparkle helps compared to using AAS and AAC in a stand-alone fashion, example reports for AAS and AAC are included in the supplementary material, together with raw output of AutoFolio and SMAC (the state-of-the-art AAS and AAC systems currently integrated into Sparkle). Concretely, compared to AutoFolio, the most significant additions of the Sparkle report for AAS are: a detailed description of the algorithm selection procedure and settings, performance comparisons between individual solvers and the algorithm selector, insight into individual contributions of component solvers to the performance of the algorithm selector, and plots for visual comparison. Similarly, compared to SMAC, the main additions of the Sparkle report for AAC are: a detailed description of the algorithm configuration procedure and settings, plots for visual comparison, and a comparison of the number of instances on which the target algorithm timed out with default and configured parameters. Beyond the reports, other ways to support understanding include analysis tools, for instance to assess parameter importance (see Section VII-B).
Sparkle for Parameter-Less Solvers
With the design principles from the previous section in mind, we first consider Sparkle for the simplified case in which the performance parameters (which only affect the performance of a given solver) for all solvers are fixed at a predetermined setting or no such parameters exist. Under these conditions, AAC is not applicable, but performance complementarity between solvers can be exploited using AAS. The per-instance algorithm selection problem arises in situations where no single algorithm is the best for every problem instance of interest, and thus performance can be improved by selecting the best performing algorithm on a per-instance basis [10]. AAS aims to tackle the per-instance algorithm selection problem in an automated fashion [36], [37], [38]. The key idea is to construct a per-instance algorithm selector that predicts the most suitable algorithm for each given problem instance, based on reasonably efficiently computable features of that instance. Fig. 1 illustrates a typical process followed in the construction of an algorithm selector and indicates where the version of Sparkle described in the following improves the ease of use compared to using a selector construction tool without aid beyond the affordances commonly made by selector construction tools.
Typical algorithm selector construction process, with steps where Sparkle improves the ease of use compared to using a stand-alone selector construction system shown in boldface. Solid lines connect related steps; dashed lines separate the phases: preparation, execution, and analysis.
At the conceptual level, the Sparkle platform for parameter-less solvers then comprises the following core components.
A collection
of solvers.$S$ A collection
of problem instances used for training.$I$ A collection
of feature extractors to compute problem instance features.$E$ Feature data
computed using the feature extractors in$F$ for the instances in$E$ .$I$ Performance data (for now, only running times are implemented, but the main principles are the same for solution quality optimization)
for the solvers in$P$ on the instances in$S$ .$I$ A procedure
for constructing a portfolio-based selector$C$ based on$R$ , optimized for performance on$S$ , using feature data from$I$ and performance data from$F$ .$P$ A procedure
for computing the contributions of any solver$N$ to a given portfolio-based selector$s\in S$ on the instances in$R$ .$I$
In addition, we need to provide the following support components.
A mechanism for performing runs of solvers from
, of feature extractors from$S$ , of the portfolio-based selector$E$ , of the construction procedure$R$ and of the contribution analysis procedure$C$ .$N$ A user interface (UI) that makes it easy to add solvers to
, to remove solvers from$S$ , to submit an instance$S$ to be solved, and to access performance and contribution data for solvers in$i$ .$S$
Above, in Listing 1, we show a concrete example of how these components are realized and used in the form of concrete Sparkle commands for constructing a per-instance algorithm selector based on three SAT solvers.4
As seen in this example, in practice, constructing a per-instance algorithm selector in Sparkle works as follows.
After a general initialization in line 1, a user seeds
with a collection of problem instances5 suitable for training selectors in line 2 (e.g., when dealing with SAT solvers, these could be taken from past SAT competitions6 [39], [40], [41], [42], [43], [44], [45], [46]), adds one or more solvers to$I$ (lines 3–5) and adds one or more feature extractors to the collection$S$ (line 6). The user then triggers the computation (optionally with–$E$ parallel ) of the feature data (line 7) and the performance data$F$ (line 8), by using a Sparkle command to run$P$ and$S$ on all instances in$E$ with a fixed cutoff time$I$ (specified through a settings file). Finally, the user runs the construction procedure$t_{\mathrm {max}}$ to obtain an initial portfolio-based selector$C$ (line 9). At any point following this, the user can generate a report for the current portfolio-based selector (line 10).$R$ When a new instance
(or a set of problem instances) is to be solved, it can be passed to the$i$ run_sparkle_portfolio_selector.py command. Following this, its features are computed, using the feature extractors in ; the resulting feature vector is then passed to the portfolio-based selector$E$ , which runs one or more solvers from$R$ on the new instance$S$ . When multiple solvers are chosen by the selector, these are executed according to a fixed schedule, based on the output of the selector.$i$ When a new solver
is added by the user,$s$ is added to$s$ and run on all instances in$S$ (this can be streamlined by using the –$I$ run -solver -now option with theadd_instances.py command) with cutoff time ; the resulting performance data are added to$t_{\mathrm {max}}$ . We note that$P$ is run once for each instance, even for nondeterministic solvers. This approach has been taken, in order to avoid the overhead of performing multiple runs per instance, and to match common practices in current competitions. After this,$s$ is run to obtain a new portfolio-based selector$C$ that may utilize$R$ .$s$ Similarly, when a user removes a solver
,$s$ is removed from$s$ and the performance data for$S$ is removed from$s$ . Then,$P$ is run to obtain a new portfolio-based selector$C$ that no longer uses$R$ .$s$ A report generated for a portfolio-based selector,7 contains detailed information (including the collection of instances
, solvers$I$ , and feature extractors$S$ , the procedures of constructing portfolio-based selectors, the detailed experimental setup, etc.). Further, the generated report contains the comparison of the constructed portfolio-based selector to both the single best solver and virtual best solver (VBS).8$E$
In this initial implementation of Sparkle, per-instance algorithm selectors are constructed using the freely available, state-of-the-art AutoFolio system [19], which uses an algorithm configuration procedure to automatically construct high-performance algorithm selectors. Evidently, other per-instance algorithm selectors (e.g., [11], [2], [5], and [47]) and selector construction methods (e.g., [48] and [49]), as well as algorithm scheduling and parallel portfolio construction methods (e.g., [50]) can in principle be used instead of AutoFolio to exploit performance complementarity between parameter-less solvers. The advantages of AutoFolio are that: 1) it is a general-purpose method applicable to algorithms for arbitrary problems; 2) it incorporates a number of per-instance algorithm selection techniques known from the literature; and 3) it automatically configures an algorithm selector based on the resulting collection of algorithm selection procedures and their hyperparameters, to maximize performance in a given situation, as characterized by a given set of solvers, feature extractors, and training instances. The current implementation using AutoFolio clearly demonstrates the potential of the Sparkle system.
Performance can be assessed and optimized using various metrics, such as the number of solved instances [51], as well as PAR10, the penalized average running time with a penalty factor 10, which averages running time over a given set of instances, counting each timed-out run as ten times the given cutoff time [27]. Currently, for algorithm selection, Sparkle is restricted to the widely used PAR
In the current implementation of Sparkle, solver contributions (to the PAR
To accurately compute the marginal contribution of a given solver
Using the marginal contribution to evaluate each given solver, there is a clear incentive for solver developers to focus their efforts on improving the state of the art, as represented by a high quality portfolio-based algorithm selector, such as AutoFolio [19], *Zilla [49], or ASAP [54]. The Sparkle platform operationalizes this incentive by constructing the per-instance algorithm selector, tracking solver contributions, and making detailed information on the performance and contribution of their solvers available to solver developers; it also provides a fair and well-defined way to assess solver contributions. We note that, although the marginal contribution is useful, there are other metrics one may want to consider. Selectors can also make suboptimal choices, and although AutoFolio (and thus Sparkle) specifically aims to mitigate this issue by suggesting algorithm schedules (to hopefully avoid the worst case where a single poorly chosen solver does not solve the problem), other and better solutions could be developed to address this general issue in AAS.
Considering our implementation in Sparkle, and the step-by-step process outlined in Fig. 1, we observe several improvements to the ease of use. In step 1, Sparkle takes care of most work the user would otherwise have to do to collect the performance data. Instead of writing their own scripts to run all combinations of instances and algorithms, ensuring correct and consistent cutoff time measurements between all algorithms (step 1b), and formatting the performance data (step 1c), with Sparkle, the user only has to adapt a small part of a wrapper template (for each algorithm) to call the algorithm executable and print the performance. Sparkle provides similar savings in time and effort for feature data preparation (step 3). In step 4b, Sparkle provides a minor convenience by automating the installation of AutoFolio, but this also saves some time and effort. Importantly, Sparkle ensures the process is performed correctly, by always running validation on the training set in step 6b. With AutoFolio, this is not guaranteed, since it runs validation only when the selector is not saved to a file. In step 7, Sparkle helps the user beyond the basic performance indicators returned by AutoFolio, by generating a report. This report includes (in addition to the basic performance indicators AutoFolio also returns for the produced selector) the individual performance per solver and a visual comparison of the selector to the single best solver (step 7a), the contribution to the selector of each solver in the form of the marginal contribution (step 7b), and the followed experimental procedure and references (step 7c). The automatic generation of the report not only saves time compared to using custom-built scripts or partially manual processes, but also helps to avoid errors.
To confirm the benefits of Sparkle for AAS, a small study was performed with two users. They were asked how much time and code they needed for each step from Fig. 1. One user (using AutoFolio directly and through Sparkle) reduced the time spent by 68 % when using Sparkle compared to using AutoFolio without Sparkle, and reported a similar reduction (69 %) in terms of the lines of code they needed to write. For the second user (having only tried AutoFolio9) the results were strongly influenced by steps 1a and 2a which together took up more than 95 % of the time, and also a large portion of the code. This is the result of the user having to spend significant time and coding effort to get algorithms from external sources installed and running on a specific HPC environment (step 1a), and also spent significant time collecting problem instances (step 2a). Taking this into account, the time spent could have been reduced by 1 % when using Sparkle and the code written by 5 %. When we exclude these two steps (we note that these two steps are equal between Sparkle and AutoFolio without Sparkle), and look only at the remaining steps, however, the savings were quite a bit more significant, with the user being able to save 22 % of their time and 16 % of the coding effort. It is also worth mentioning that this second user did not compute the marginal contributions of the component algorithms to the produced algorithm selector, something Sparkle would have done for them. For both users, Sparkle helped to reduce the time and effort spent for making effective use of AAS. This suggest that Sparkle can indeed make the use of AAS easier. Full results and an explanation of the data processing are included in the supplemental material.
Sparkle for Parameterized Solvers
The version of Sparkle considered in the previous section exploited the fact that there are effective algorithm selection methods that can be used to leverage complementary strengths of nondominated solvers. In practice, many of these solvers are (or easily can be) parameterized, such that they can be configured for optimized performance on different types of instances. For example, in the case of SAT solving, the degree to which the performance of state-of-the-art solvers can be optimized for specific types of instances has been demonstrated in the CSSCs [3]. This motivates the use of automated configuration procedures, such as SMAC [20], irace [23], ParamILS [27], or GPS [28], to obtain state-of-the-art solvers for specific types of instances. Fig. 2 describes a typical process followed in algorithm configuration and indicates where the version of Sparkle described in the following improves the ease of use compared to using a configurator without aid beyond the affordances commonly made by algorithm configurators. To effectively exploit the configurability of solvers, we now describe an extension of the basic version of Sparkle discussed in Section IV.
Typical algorithm configuration process, with steps where Sparkle improves the ease of use compared to using a stand-alone configurator shown in boldface. Further, Sparkle also provides useful support for steps that are already commonly supported by AAC tools, e.g., by providing a broadly usable and carefully constructed wrapper template for target algorithms. Solid lines connect related steps; dashed lines separate the phases: preparation, execution, and analysis.
In addition to the components introduced in the previous section, this extended version of Sparkle also comprises the following.
Configuration spaces for some (or all) of the solvers in
, where a configuration space$S$ consists of a list of parameters, a domain of possible values for each such parameter, a (possibly empty) set of conditional parameter dependencies and a (possibly empty) set of constraints on combinations of parameter values.$\Theta $ An AAC procedure
for optimizing the performance of a given solver$O$ on a subset of$s \in S$ .$I$
In a practical setting, we also need to provide.
A mechanism for performing runs of the automatic configuration procedure
;$O$ UI affordances to specify configuration spaces for solvers in
; to specify which subset of instances from$S$ is to be used for a run of the configuration procedure$I$ on a specific solver$O$ , to launch such a run and to validate the performance of the configurations of$s \in S$ thus obtained on another subset of instances from$s$ .$I$
The implementation in Sparkle is illustrated in Listing 2.10 Following initialization (line 1), two instance sets from the same distribution are added to
Example of algorithm configuration in Sparkle for the capacitated vehicle routing problem.
Currently, SMAC [20], which is a widely used state of the art configurator that is also freely available for academic use, is the configurator
The current performance metrics for running time (PAR
In this version of Sparkle, the configuration procedure
When comparing our implementation in Sparkle to the step-by-step process for AAC shown in Fig. 2 (and specifically, SMAC), we observe several usability improvements. In step 4b, Sparkle automates the creation of files with paths to each individual training and testing instance. While users can do this fairly easily with their own scripts, it still saves some time. Installation of the AAC procedure is automated by Sparkle (step 5b). In step 7, Sparkle ensures a number of important aspects of AAC happen correctly (i.e., following the standard protocol for AAC [35]), and also saves time by automating these steps. Specifically, step 7b ensures multiple configuration runs are done, step 7c provides a minor aid by automating the creation of a configuration scenario file for SMAC, and steps 7d–7f automate the correct handling of validation for configuration results and selecting the final configuration. Finally, in step 8, Sparkle provides major support by providing much more detailed result analysis than SMAC does by default. Beyond the performance of the configured algorithm and the parameter string associated with it given by SMAC, Sparkle additionally provides the performance of the default configuration, plots comparing the performance of the configured and default algorithm parameters per instance, the number of timeouts of those two configurations, and a description of the followed experimental procedure. The automated collection of all results saves time, and gives the user substantially more insight. It also helps prevent mistakes.
For AAC, we also performed a small user study with four users, who were asked to estimate how much time and code they needed for each step in Fig. 2. One user (having tried AAC with SMAC13) would save 35 % of their time by using SMAC through Sparkle, while their code use would be almost equal (only 1 % less). Interestingly, the user did not follow many of the substeps of step 7 (see Fig. 2), which relate to the standard protocol for AAC [35], and thus little or no time was spent on these substeps. Had they followed the standard protocol, time and code savings with Sparkle likely would have been substantially larger. This also supports the need for tools like Sparkle, that, by design, ensure the correct and efficient use of AAC. The second user (also using AAC with SMAC13), could have reduced time spent by 51 % and lines of code written by 43 % by using Sparkle. For the third user (using SMAC through Sparkle) the data they entered shows that they spent almost no time and had to write no code at all for steps 7 and 8, suggesting that the major support Sparkle provides for these steps is helpful. Since we cannot estimate how much time and code they would have needed for those steps when using SMAC without Sparkle, it is not clear how much time and coding effort they saved by using Sparkle, but it is reasonable to assume the savings were substantial. The fourth user (using SMAC13 without Sparkle) could have reduced the time they spent by 40 % with Sparkle, and the lines of code written by 29 %. The interested reader can find the instructions for users, complete results, and analysis procedure in the supplemental material.
Sparkle as a Competition Platform
By utilizing the Sparkle platform, we can organize Sparkle challenges, a novel type of competitive event, which aims to advance the state of the art in solving various challenging computational problems, including Boolean satisfiability (SAT) [46], AI planning [57], and other problems, by leveraging automatically constructed algorithm selectors and by quantifying contributions of individual solvers.
As mentioned previously, it is well established that the state of the art for solving challenging computational problems (e.g., SAT [58], planning [54], minimum vertex cover [59], answer set programming [48], satisfiability modulo theories [60], etc.) is not defined by a single solver, but rather by a collection of nondominated solvers with complementary strengths. To exploit this performance complementarity, machine learning techniques can be leveraged to build effective automatic algorithm selectors that utilize state-of-the-art solvers. Sparkle challenges automatically combine all participating solvers into a state-of-the-art algorithm selector, and assess the contribution of each participating solver to the performance of that algorithm selector, using the functionality introduced in Section IV. Participants are incentivized to advance the state of the art as measured by this selector, by maximizing the contribution of their solver to the overall selector performance.
At the moment, traditional solver competitions (such as the international SAT competitions6 [44], [45], [46], international planning competitions14 [57], [61], [62], etc.) measure the performance of each individual solver across a large set of benchmark instances, and identify the winning solver(s) based on their overall performance across this instance set.
Rather than the gold, silver, and bronze medals awarded in traditional solver competitions, participants in Sparkle challenges are awarded slices of a single gold medal; the size of each slice is proportional to the magnitude of the marginal contribution made by the respective solver to the performance of the automatically constructed selector built from all participating solvers on all benchmarking instances. That is to say, Sparkle challenges identify the best solver per instance, and award solvers based on the number of instances for which they contribute the best performance.
In recent years, we have already organized two Sparkle challenges with earlier unpublished versions of the Sparkle platform described here. The Sparkle SAT Challenge 20181 was an official competition affiliated with the 21st International Conference on Theory and Applications of Satisfiability Testing,15 while the Sparkle Planning Challenge 20192 was a satellite competitive event affiliated with the 29th International Conference on Automated Planning and Scheduling.16
Both events attracted significant participation and produced a series of interesting results, thus demonstrating the viability of using Sparkle for new competition formats. (A presentation and discussion of the results of the Sparkle challenges are beyond the scope of this article, but the interested reader can find further information on the web pages referenced earlier.)
Other Uses and Extensions
In Sections IV and V, we have outlined the two primary use cases supported in the current version of Sparkle, algorithm selection, and algorithm configuration. Beyond those, a number of other use cases and extensions should be highlighted.
A. Benchmarking
Many tools in Sparkle can also be used for traditional benchmarking tasks, in addition to their use as part of the selection or configuration processes. In particular, the parallel processing functionalities available in Sparkle turn the parallel execution of a collection of algorithms or computation of the features for a set of instances into simple tasks, especially compared to writing scripts for each new case.
One specific benchmarking use-case enabled by Sparkle is the comparison of multiple solvers by using the AAS functionalities. This facilitates analysis for a specific experiment or application, as opposed to competitions, which often consider larger sets of solvers and problem instances. The report resulting from AAS provides, for each solver, the respective PAR score and marginal contribution to the portfolio selector. This gives an indication of the value of each solver. We emphasize again that analyzing solvers based on their marginal contributions better represents the state of the art than traditional ranking schemes aimed at pinpointing a single best solver. More details on which solver worked well on which specific instances are stored in a file containing the results per problem instance, but currently have to be compared manually.
B. Parameter Importance Analysis
Following algorithm configuration, solver developers can gain valuable insights by analyzing the resulting configuration in more detail. Parameter importance analysis gives insights into how much individual parameters contribute to the performance difference between two parameter configurations of a solver
Currently, Sparkle uses ablation analysis [63] to assess parameter importance. Ablation analysis constructs a path from the default configuration
Ablation analysis in Sparkle requires a solver
C. Parallel Algorithm Portfolios
Similarly to AAS, parallel algorithm portfolios (PAPs) [50], [64], [65] can help to obtain greater performance out of a given (set of) algorithm(s). Where AAS leverages the performance variation between different algorithms, PAPs can also utilize the variation between multiple runs of a single, randomized algorithm. This is done by executing a portfolio of algorithms in parallel. When considering running time optimization, the performance variation between different algorithms will result in each finding solutions faster for different instances.
Here, we consider basic PAPs that do not take advantage of additional information. As such, it is not necessary to predict which algorithm is fastest, but all algorithms in the portfolio are always included, and in theory, they can always take advantage of the fastest running time, trading off wall-clock time against parallelism. In practice, the parallelization creates some overhead, and more sophisticated PAPs can run only the solvers predicted to be fast, in order to reduce overhead.
This first version of PAPs has been implemented in Sparkle, and broadly follows the same command flow as AAS and AAC. A selection of algorithms can be combined into a PAP, and for each nondeterministic algorithm, the desired number of copies can be indicated. Once a portfolio has been created, it can be used on a subset of instances in
Best Practices and Pitfall Avoidance
It is well known that pitfalls commonly arise when using AAC techniques, and that certain practices are helpful in avoiding these and ensuring successful applications [17]; similar considerations apply to the effective use of AAS and other meta-algorithmic techniques (e.g., in the context of CP [66]). Sparkle aims to incorporate best practices to make correct and effective use of these techniques.
We do not exhaustively discuss all pitfalls here, since some of them do not apply (e.g., each configurator handling the algorithms under configuration differently) and others are beyond the scope of the current implementation of Sparkle (e.g., assuring generalisability across machines with different hardware). Naturally, as Sparkle is further extended, additional best practices will be incorporated.
Correct interaction between the solvers included by the user, the configurator, and Sparkle itself, is supported by providing wrapper templates to the user for both AAS and AAC.
As mentioned previously in the discussion of Sparkle for parameterized solvers (Section V), the widely used standard protocol for algorithm configuration [35] is integrated into Sparkle, to facilitate its correct and efficient use. Particularly, Sparkle performs multiple runs of the configurator, validates the best configuration from each run on the training instances, and selects the best performing from those.
To ensure the correct handling of the running time cutoffs for target algorithms, Sparkle uses runsolver, a tool that is widely used in benchmarking studies and competitions [67], [68]. The main goal in this context is to measure running times in a trusted and consistent way, rather than relying on target algorithms that may each measure and report their running time differently and possibly incorrectly. For some cases, runsolver is called in the wrapper the user provides for their solver, and while templates for these wrappers are provided that include this call, it still, in part, relies on the user. Like the running time cutoffs, the actual measurements of the running time of target algorithms are also done through runsolver. The same goes for other situations where time is measured, such as feature extraction.
With regard to the termination of target algorithm runs, Sparkle also relies on runsolver: To ensure termination happens at the right time, runsolver measures the time, and terminates when the cutoff time is reached. As with time measurement, the termination procedure also partially relies on a wrapper that is adapted by the user for their target algorithm, but otherwise Sparkle handles everything.
Correct use is also supported by some smaller adjustments. When configuring nondeterministic algorithms, Sparkle uses multiple random seeds per problem instance, to avoid over-tuning to specific seeds. To deal with unexpected output, some checks are included, e.g., to try to detect target algorithm crashes. To deal with incorrect results returned by target algorithms, solution checkers should be used whenever possible, and such a checker is included for SAT. Memory limits for algorithm runs are handled through the Slurm workload manager,17 by including them in run and batch calls to Slurm.
Conclusion
To conclude, we summarize the main takeaways of our work, before briefly discussing directions for future work.
A. Summary
High-performance solvers have been key to improving our ability to solve computationally challenging problems, such as SAT, MIP, and AI planning, including in academic and real-world settings. Improvements to these solvers are in part incentivized by benchmarks and competitions, where participants drive each other to continuously advance the state of the art. At the same time, by utilizing meta-algorithmic techniques, such as AAC and AAS, performance can be improved by making maximal use of the potential available from existing algorithms and algorithm components.
Assessing the performance of such improvements is still commonly done by measuring which algorithm is the best overall. However, this is not an accurate representation of the state of the art. The true state of the art is represented by a collection of complementary solvers that perform well on different subsets of problem instances (as also leveraged by AAS). This is complemented by techniques, such as AAC and PAPs, which get us closer to the maximal performance potential of algorithms or portfolios of algorithms, and as such also support a more accurate picture of the state of the art. The assessment of the true state of the art requires complex specialized techniques that are not easy to use correctly, and the same applies to using meta-algorithmics to benefit from their ability to maximize performance.
To this end, we introduced the Sparkle platform. With this platform, meta-algorithmic tools are made accessible to solver developers and users that may not have much expertise in meta-algorithmics. The key principles behind making these tools accessible are that they can be used effectively and correctly, even by nonexperts. To achieve this, standard protocols are implemented to automatically ensure correct use and pitfall avoidance where possible. Otherwise, checking mechanisms are used to alert the user to potential problems, and support messages are provided to guide them in resolving potential issues. All of this drives the increased adoption and application of meta-algorithmic techniques, which in turn improves performance and results in a more accurate view of the state of the art.
Specifically, in this work we have shown how AAS and AAC are implemented in and accessible through Sparkle. By implementing standard protocols, the effort and expertise required from users to adopt meta-algorithms are decreased compared to stand-alone AAS and AAC tools. In addition, we have outlined the use of Sparkle as a competition platform. By employing the meta-algorithms available in Sparkle for competitions, the outcome of the competitions more accurately represents the true state of the art. With AAS solvers can be credited based on their contribution, which recognizes that different solvers are the best for different instances, and there is no one best solver. Finally, extensions to parameter importance analysis and the use of PAPs have been discussed to showcase how Sparkle can further support the user in analyzing results and gain performance from other meta-algorithmic techniques.
B. Future Work
To reach the overarching goal of the Sparkle platform to cover the full range of meta-algorithmic techniques that are of broad interest, several extensions are desirable.
The Sparkle team aims to ensure support for an up-to-date collection of AAS and AAC techniques, and mechanisms for others to add such techniques. This will keep Sparkle aligned with the advancements in AAS and AAC, and also facilitate extensions to benchmarking and comparison of these techniques. By including, new, improved, techniques in Sparkle, users can easily adopt them, since the interface stays the same.
To further maximize the impact of solvers, integrations of selection and configuration procedures will be provided in Sparkle. One way is to configure a solver that optimizes its contribution to a portfolio-based selector, similar to Hydra [13], [69]. Sparkle could be extended to obtain new configurations for a set of separate and possibly very different solvers, rather than for a single parameterized solver, as in Hydra. With this extension in place, interesting combinations of AAC and AAS could be used in Sparkle challenges.
While multiobjective meta-algorithms are still scarce, some do exist, such as the multiobjective algorithm configurator, MO-ParamILS [70]. Sparkle should be extended to support this for a broader audience, to also facilitate the many problem domains with conflicting objectives.
For AAS as well as AAC, several useful extensions to Sparkle are possible. For AAS, this includes: 1) metrics other than PAR
Once the functionality currently provided by Sparkle is in broad use, the state of the art in solving a diverse range of computational problems should, and can easily, be assessed more accurately, by means of solver configurability and complementarity. Improvements to the state of the art can then also be better incentivised by new types of competitions, such as the CSSCs [3] and Sparkle challenges.1,2 Beyond providing the necessary tools, future versions of Sparkle should therefore simplify setting up such competitions, e.g., by reducing the manual work for organizers.
In parallel to the possible extensions mentioned above, Sparkle should also be further improved in how it supports users. For example, a testing harness could be supplied that audits the compliance of a solver with the requirements of Sparkle’s configuration mechanism, and installation of both Sparkle itself and target algorithms could be simplified with support for container technologies (e.g., Docker18 or Kubernetes19). On the practical side, the current version of Sparkle only runs with the widely used Slurm workload manager.17 However, everything could in principle run on a single machine, a cluster, or in the cloud, and we aim to accommodate all of these scenarios in the future. Finally, by further reducing the prerequisite knowledge required to use Sparkle, meta-algorithmic techniques should become even more accessible to a progressively wider audience.
Once completed, these extensions bring Sparkle close to the vision of making the full range of meta-algortihmic techniques that are of broad interest accessible and usable for a wide audience. The resulting increase in adoption of these techniques should help researchers and practitioners to realize the full potential of their algorithms, and to benefit from each others’ work, thus maximally advancing the true state of the art in solving challenging computational problems.
ACKNOWLEDGMENT
The authors would like to thank Richard Middelkoop for work on the implementation of PAPs, and all users we have interviewed for taking the time to provide input on the amount of code and time they needed per step for AAS or AAC. Some of the ideas discussed in this document have their roots in joint work and discussions with Frank Hutter, Chris Fawcett, and Kevin Leyton-Brown.