Scalable Gaussian Processes on Discrete Domains

Kernel methods on discrete domains have shown great promise for many challenging data types, for instance, biological sequence data and molecular structure data. Scalable kernel methods like Support Vector Machines may offer good predictive performances but do not intrinsically provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. While the scalability of Gaussian processes can be improved using sparse inducing point approximations, the selection of these inducing points remains challenging. We explore different techniques for selecting inducing points on discrete domains, including greedy selection, determinantal point processes, and simulated annealing. We find that simulated annealing, which can select inducing points that are not in the training set, can perform competitively with support vector machines and full Gaussian processes on synthetic data, as well as on challenging real-world DNA sequence data.


I. INTRODUCTION
U NCERTAINTY quantification is an increasingly important feature of machine learning models. This is particularly crucial in applications such as in biomedicine [1]- [3], where prediction errors may have serious repercussions. Consider a wet lab biologist seeking to find a DNA sequence which can be targeted by a drug (for instance, using CRISPR-cas9 [4]). They have reduced the problem to some number of candidate sequences but to further narrow the selection requires painstaking experiments. If they had a framework that could incorporate their prior knowledge of DNA sequence similarity as well as the results from previous experiments, they could optimally select the best next experiment to perform, thereby saving vast amounts of time and resources. Such a framework would need to perform well under various data sizes as well as provide calibrated uncertainty estimates in order to make an informed decision.
Many problems, like this one, are discrete, involve large datasets, and require well-calibrated uncertainty estimates. Kernel methods have shown performances that are competitive with deep learning models in such application domains [5], while probabilistic modeling provides a unified framework for prediction and calibrated uncertainty estimates [6]. One class of probabilistic kernel methods that have proven to be useful in various regression and classification settings are Gaussian Processes (GPs) [7]. They are data efficient, non-parametric, and have tractable posterior distributions. Moreover, one can use any kind of likelihood for the generating process, for example, a Bernoulli likelihood in the case of a classification problem.
The main challenge of scaling GPs to large datasets lies in the computational complexity of inference which is cubic in the number of observations. Inducing point methods are the main class of approaches for circumventing this limitation [8]- [11]. These methods aim to use some m n inducing points to reduce the inference complexity to O(nm 2 ).
Having reduced the computational complexity of inference, the remaining challenge is to choose the set of inducing points that best approximates the full model [8]. When the domain is continuous, the locations of the inducing points can be optimized using the gradient of the log marginal likelihood [12]. Unfortunately, this gradient-based optimization scheme is not feasible in discrete domains, where the log marginal likelihood is no longer differentiable with respect to the Inducing points for the sparse Gaussian Process are chosen from the data points, but also from the rest of the domain. The choice of inducing points is optimized with respect to the log marginal likelihood. A discrete kernel function is chosen to construct the sparse approximation of the GP's covariance matrix. The GP can then be used to predict latent function values and uncertainties on the input domain.
inducing point locations.
In this work, we explore different techniques for choosing inducing points over discrete domains by combining discrete optimization with sparse GP approximations. In our experiments, we show that our sparse GP framework has comparable performance to full (i.e., not sparsified) GP models as well as Support Vector Machines on biological sequence data.
We make the following contributions: • We present the first empirical assessment of a range of different inducing point selection techniques for Gaussian Processes on discrete domains. • We discuss and evaluate the tradeoffs of these different techniques, for example, in terms of computational complexity and their ability to choose inducing points from outside the training set. • We assess the performance of the models on synthetic data and several challenging real-world datasets.
In the following sections, we describe the main components of our framework beginning with sparse GPs, continuing to discrete inducing point selection methods, and concluding with the spectrum string kernel. Each inducing point method corresponds to a different sparse string GP in our framework. For a high level overview of the framework see Figure 1. Finally, we present experiments comparing each inducing point selection technique in our framework using both Gaussian and Bernoulli likelihoods, that is, binary labels, on synthetic toy data, UCI splicing data [13], and the DREAM5 dataset [14].

II. SPARSE GAUSSIAN PROCESS APPROXIMATIONS
Consider a supervised learning problem in which the goal is to estimate a latent function f : X → R given observed inputs x := (x 1 , . . . , x n ) and corresponding outputs y := (y 1 , . . . , y n ). For the biologist example in the previous section, f could map DNA sequences to drug targetability scores. We assume that our observations are corrupted by additive noise η, thus y = f (x) + η, where we have overloaded the notation of the function to be broadcast elementwise. Following a long line of previous work [7], we treat the function f as an unobserved random variable with a Gaussian Process prior. Specifically, a GP prior with a zero mean function and a covariance kernel k(·, ·): It follows that the prior on the function outputs, f := f (x), is given by N (0, K xx ) where K xx denotes the Gram matrix (also known as the kernel matrix) with In general, the predictive distribution cannot be solved in closed form but in the special case of Gaussian noise, where η ∼ N (0, σ 2 ), the predictive distribution can be computed as where the test inputs and outputs are denoted x * and f * respectively and K * · = K · * is shorthand for K x * · . While this closed-form predictive distribution is appealing and has found numerous applications, scaling it to large datasets is fundamentally limited by the matrix inversion This motivates the use of so-called "inducing point" methods which provide a framework for trading model quality for tractability. We assume that there exists a set of m inducing points (z 1 , . . . , z m ) =: z, z i ∈ X with outputs u := f (z) which are distributed N (0, K zz ) according to the prior. Now, we make the modeling assumption that f and f * are conditionally independent given u, that is, p(f , f * , u) = p(f | u) p(f * | u) p(u). We can again solve the inference problem in closed form: Note that we have reduced the cubic part of inference from O(n 3 ) to O(m 3 ), where we can choose m, the number of inducing points. Overall, the inference procedure has complexity O(nm 2 ) [8]- [11]. Inducing point methods provide a framework for dramatically decreasing the computational complexity of inference, but we are still left with the problem of choosing the set of inducing points that achieves the best possible approximation with limited resources (namely m inducing points). This inducing point selection can be cast as an optimization problem in which we are trying to maximize the log marginal likelihood: with respect to the locations z. Standard methods for solving the inducing point selection problem focus on continuous inputs and overlook the case of discrete ones. In the following section, we tackle this problem on discrete domains using effective and well-tested discrete optimization methods.

III. DISCRETE OPTIMIZATION TECHNIQUES
The problem we are trying to solve using discrete optimization is arg max z log p(y | z) (Eq. 1). In the following sections, we approach this problem using two classical techniques from discrete optimization and one submodular data summarization model: greedy selection, simulated annealing, and determinantal point processes.

A. GREEDY SELECTION
Greedy inducing point selection dates back to early works on sparse Gaussian Processes [9], [15], [16]. The algorithm is initialized with an empty set of inducing points and at each iteration greedily selects the next observation in the data that maximizes the marginal likelihood p(y | z) (Eq. 1). Thus, the set of inducing points is a mere subset of the original data. This approach is justified by the fact that the marginal likelihood is strictly monotonic in the number of inducing points. Hence, adding a new inducing point is always guaranteed to increase the objective. This technique is conceptually simple and easy to implement, but it comes with the major drawback that the inducing points can only be selected from the training set. Especially in high-dimensional discrete spaces, the training set might only span a small fraction of the total space, so this can be a strong limitation.
Natural extensions of this method include selecting several inducing points instead of just one at every iteration, swapping inducing points between the training set and the inducing point set [17], and optimizing a variational lower bound on the likelihood rather than the likelihood itself [9]. Since we mainly include this method as a baseline in our experiments, we leave the exploration of these extensions to future work.

B. SIMULATED ANNEALING
Simulated annealing is a sampling-based approach which starts with an initial guess S 0 := {z 1 , . . . , z m } and a loss function L(·) to be optimized. At each iteration the algorithm perturbs an element of the set and decides whether or not to accept this new perturbation as the next state. To make this decision, an energy term is computed from the current iterate S t−1 and the proposalŜ, E t = L(S t−1 ) − L(Ŝ). The new proposal is then accepted with probability where T t is known as the temperature parameter and is usually chosen with an exponential decay rate in t.
Since we are working on discrete string domains, we define a perturbation to be a change of one or more characters in a given string. Determining the number of characters to change requires careful fine tuning. In our experiments we chose the most conservative setting of perturbing just a single character at each step. The loss function is again naturally defined as L(z) := log p(y | z).
Crucially, this perturbation approach allows the simulated annealing to explore inducing points from the entire input space, and not only from the training set. This makes it more flexible than greedy selection or the determinantal point process described in the following.

C. DETERMINANTAL POINT PROCESSES
A Determinantal Point Process (DPP) with kernel k is a distribution over subsets of observations [18]. Given a subset of points z ⊆ x with |z| = m as above, its probability is defined as 1 where I is the identity matrix. Intuitively, the determinant of the Gram matrix K zz represents the volume of the parallelepiped spanned by the features in z. Therefore, subsets of high probability have a large volume in feature space which in turn implies diversity. In their recent work, [19] showed analytically that with O(log N ) inducing points sampled from a DPP, the sparse GP is close to the full GP in KL-divergence. While this result only holds for squared-exponential kernels, we consider it to be a theoretical motivation for using DPPs for inducing point sampling.
While the normalization constant, det(K xx + I), is notably available in closed form, this is not relevant for our purposes since it still requires O(n 3 ) operations to compute. Instead, we use fast MCMC-based sampling methods for our DPPs [20]. Note that this could in principle be extended to sampling from the whole input space, but it would make the normalization constant intractable and require further approximations. For simplicity, we thus resort to just sampling inducing points from the training set in this work (similar to the greedy approach above).

IV. STRING KERNELS
While our framework is fully general, in this work we focus our experiments on biological sequences, which are an important real-world discrete data domain. One key aspect of many biological sequences is translation invariance. In this section, we describe n-gram-based string kernels which are designed to exhibit this property and thus explicitly incorporate our biological prior knowledge. In cases where full translationinvariance is not a desired property, a practitioner can choose from the vast literature on kernel methods to select a more appropriate prior for the GP.
Specifically, in our work we use the spectrum kernel [21], which was designed for protein sequences and has also been successfully applied to other types of biological sequences [22]. There are existing applications which use string kernels in Gaussian Processes, but they use small datasets (n ≈ 280) where full Gaussian Process inference is viable [23]. In this work, we enable the extension of these methods to larger data sets, through the use of sparse GP approximations.
Given an alphabet A, we denote the input domain of all strings of finite length as X = A * . The n-th order spectrum kernel is defined over this domain as is the number of times that the string a ∈ A n appears as a substring in x. This is essentially a bag-of-ngrams model.
While the set A k might be prohibitively large, thus making the feature maps Φ n (x) prohibitively high dimensional, it can easily be seen that we can compute k n (·, ·) without having to represent Φ n (x) explicitly. For two strings of arbitrary length, x ∈ A l(x) and x ∈ A l(x ) , the kernel can be rewritten as Computing this kernel naïvely has complexity O(l 2 ) where without loss of generality l := l(x) ≥ l(x ). This can be further improved using suffix trees, resulting in a complexity of O(kl) with k < l [21].
These three components-a GP prior represented by the choice of the kernel function, an inducing point GP approximation, and finally a method for selecting inducing points from a discrete input space-provide a unified framework for supervised learning over discrete input spaces using GPs. This framework not only has good predictive performance but also provides superior uncertainty estimates which we demonstrate in the following experiments. The main challenge within this framework remains the selection of inducing points from the discrete domains, which is why we will thoroughly assess the different techniques and their tradeoffs in the following.

V. EXPERIMENTS
We compared different inducing point optimization methods for sparse GPs on toy datasets in regression and classification settings. We then validated our framework's performance on two real-world DNA sequence datasets from the UCI repository [24] and the DREAM5 dataset [14]. We used support vector machines (SVMs) [25] with post hoc uncertainty calibration [26] as a competitive benchmark method for the classification tasks.
We find that our sparse string GP framework performs well when compared to full string GPs in these diverse settings. Moreover, the inducing points selected by the algorithm align well with the natural intuition for inducing points on continuous domains. Our framework offers comparable predictive performance with SVMs but yields superior uncertainty calibration.

A. IMPLEMENTATION
If not otherwise noted, all GPs and SVMs use a spectrum kernel as implemented in Shogun [27], [28]. For fitting the GPs, we used the GPy framework [10]. For fitting the SVMs, we used the sklearn package [29].
Note that in the regression experiments, the GP posterior can be computed in closed form (as described in Sec II), since the Gaussian likelihood is conjugate with the Gaussian prior. However, in the classification examples, we need to use a non-conjugate Bernoulli likelihood, and thus have to resort to approximate inference. In our experiments, we use expectation propagation for the approximate inference [7], [30], since it is fast and yields good performance. We use it with the GPy standard parameters. Alternatively, one could use Markov Chain Monte Carlo (MCMC) inference to get an even better (that is, asymptotically exact) approximation to the posterior. However, this would be computationally much more expensive and would likely outweigh all the computational benefits of our sparse approximation, which is why we have not tried this approach.
Since the SVM does not natively output probabilities, we have to calibrate it in order to turn the SVM predictions into probabilities. The Calibrated SVM uses a technique called Platt scaling [26]. It performs a logistic regression on the SVM outputs and calibrates it using a cross-validation on the training data.

B. PERFORMANCE EVALUATION
In order to assess the predictive performance of our GP models on regression and classification tasks, we use the mean squared error (MSE) and area under the precision-recall curve (AUPRC), respectively. We use calibration curves [31], [32] to assess uncertainty calibration. Moreover, we report the mean absolute deviation (AD) of the calibration curves from the diagonal. Note that a perfectly calibrated classifier would lie directly on the diagonal of the plot and hence yield an AD of zero.

C. INDUCING POINT OPTIMIZATION FOR REGRESSION AND CLASSIFICATION
We developed a simple toy experiment as a controlled setting for our initial comparisons. It is composed of 1000 Table 1: Performance comparison of different inducing point optimization methods and a full GP on a toy data regression task. Means and their standard errors are computed over 100 runs of the experiment. Note that the full GP is included as a gold standard, but is not considered in the actual comparison because it is not a scalable method.   strings of a 4 character alphabet (inspired by the DNA bases, 'A','C','T','G') each of which has length 100. We first generated a set of 100 strings of length 5 which we call the library. Each example in the toy dataset contains some copies of a particular element of the library, distributed uniformly at random through the sequence. The other characters are selected uniformly at random from the alphabet. The label for each example is the number of elements of the particular library sequence it contains. We think of this as a discrete dataset with 100 clusters of strings, corresponding to the elements of the library.
This toy dataset challenges the inducing point methods to effectively summarize the data for the prediction task. We use 100 inducing points for all the experiments. While in general one should perform Bayesian model selection via log marginal likelihood [7], to select the optimal kernel hyperparameter k, in this case we know the optimal value should be k = 5, since this is the size of the strings in the library.
In Tables 1 and 2, we compare a full GP model, sparse GPs with different inducing point selection methods, and SVMs. For the inducing point methods we used randomly chosen inducing points (random), greedy selection (greedy), VOLUME 4, 2016 Table 3: Comparison of training and inference time complexity for full GPs and the different sparse GPs. n is the number of training points, m the number of inducing points, s is the subset size for the greedy subset selection, and k the number of iterations for the simulated annealing.

Method
Training time Inference time simulated annealing (SA), and DPP (DPP) methods. For the greedy selection, we experimented with performing it on random subsets to improve performance. Thus, greedy_10percent corresponds to performing a greedy selection on a uniformly random selection of 10% of the training data. For the DPP, we experimented with different MCMC steps. Thus dpp_10percent corresponds to taking 10% of recommended nm steps until complete mixing.
Our results confirm the intuition that the sparse GP approximations cannot match the performance of the full GP, neither with respect to log-likelihood nor with respect to predictive performance. However, our results also demonstrate that inducing point selection is crucial in improving the performance of the sparse GPs. With careful selection of inducing points-either greedily or via simulated annealingthe sparse models can approach the performance of the full model.
Moreover, there is a clear tradeoff between runtime and performance of the methods. The methods with the longest runtime, particularly the simulated annealing, achieve the best results among the sparse GP models, while for instance the DPP model offers a much more attractive runtime, but slightly lower performance. It depends therefore on the practitioner's choice whether runtime or performance is a more important selection criterion. This also suggests that the ability of the simulated annealing to select inducing points from outside the training set can be beneficial in this application.
The inducing points chosen in both regression and classification settings follow a natural intuition. In the regression task, the model has to count the number of library elements equally well across all parts of the space. In the classification task, a more precise count close to the decision boundary is crucial for minimizing classification errors. Figure 2 clearly demonstrates this behavior.
This experiment shows that our sparse GP framework approaches the predictive performance of a full GP, while also outperforming baseline SVM methods. We also find that inducing point selection in discrete string space follows our general intuition for inducing point selection in continuous spaces.
There is a natural tradeoff between the fast but suboptimal performance of random inducing point selection and the slow but superior performance of greedy selection. We explored this tradeoff by restricting the greedy inducing point selection over the entire dataset to a randomly sampled subset of the data. We then varied the size of this random subset. In the case when the random subset is the same size as the original dataset, one recovers the greedy algorithm. See Table 3 for a summary of the time complexities of all methods discussed. The results are depicted in Figure 3. There is a clear tradeoff between performance and runtime, both of which increase as subset size increases. However, the runtime grows linearly with the subset size (as expected), while the likelihood converges. We expect that finding the optimal subset size will highly depend on the application both in terms of specific properties of the dataset as well as the computational resources available to the practitioner.

D. REAL WORLD DNA SEQUENCE DATA
To validate our models on real-world data, we performed classification on the UCI splicing dataset [13]. Moreover, to demonstrate the scalability of these methods, we performed regression on the DREAM5 dataset [14]. As with our experiments on synthetic data, we aim to compare the predictive performance and uncertainty calibration against SVMs for classification. The splicing dataset contains 3,190 sequences of 60 nucleotides each which have to be classified into splicing and non-splicing sites. We applied our methods to the pTH2427 transcription factor of the DREAM5 dataset which contains a total of 32,896 short sequences of length 8. On the DREAM5 dataset, we performed a 70-30% train-test split.
We compared a kernel SVM against a full GP and our sparse GPs with inducing points selected greedily and by simulated annealing. Note that the splicing data as well as the DREAM5 data are too large for feasible full GP inference. To provide a fair comparison, we report inference times of our sparse GP in comparison with the full GP on a randomly selected subset Table 4: Comparison between greedy and DPP methods on the DREAM5 dataset. The kernel matrix for the DPP was normalized as k(x, y)/ k(x, x)k(y, y). The greedy method was run on uniform random subsets of size 20 of the training data.   [25] Full GP [23], [33] Variational GP [34] Greedy GP [17] DPP-GP [19] SA-GP (ours)   of the splicing data in Table 6. It can be seen that our sparse GP speeds up inference by more than one order of magnitude while still yielding comparable predictive performance (c.f. Tab. 3). The sparse GPs use 50 inducing points on the splicing data. The order of the spectrum kernel was chosen to be k = 3 by all GPs through log marginal likelihood optimization.
The performance of the methods in terms of area under the precision-recall curve (AUPRC) and calibration is measured by a 2000/1190 train-test-split on the splicing data. Results are reported in Figure 4.
For the DREAM5 dataset (Tab. 4) we used a kernel of size k = 3. This was motivated both by biological considerations of the size of a codon and also by the fact that the sequences are short-only 8 characters. We found that when set to the right random subset size (in this case 20), greedy selection could outperform the DPP in both runtime and performance.
If we compare the calibration of the different methods on the classification task, it can be seen that the various sparse GP models, and the calibrated SVM are comparable in terms of calibration and predictive performance. (Fig. 4). The calibration ranking among the GPs is analogous to the one for the log-likelihoods, that is, the sparse GP with inducing points optimized by simulated annealing ranks second, the one with greedily selected points third.
These experiments show that our framework yields a comparable performance with full GP inference as well as kernel SVMs on real world DNA sequence classification tasks. Moreover, it scales to larger datasets where full GP inference is computationally infeasible. It also shows that greedy selection can perform better than the theoretically motivated DPP sampling, while simulated annealing generally performs best, possibly due to its more flexible inducing point selection from outside the training set.

VI. RELATED WORK a: Sparse Gaussian processes
This work builds upon the rich literature on inducing point methods for Gaussian Processes (see [8] and references therein). Recent work in this domain has utilized variational approximations [34] and certain geometrical structures [11]. Furthermore, it has been proposed to use spherical harmonic features [35], orthogonal inducing points [36], and doubly sparse GPs [37]. Unfortunately, all these advances are limited to continuous input spaces, so we are forced to resort to more conventional inducing point methods in this work. VOLUME 4, 2016 b: Gaussian processes on discrete domains Many kernels have been devised to work well on discrete domains, for instance, on strings [21] or graphs [38]. These have been used successfully in combination with SVMs or similar linear models for problems in biology [5], [22], chemistry [39], [40], and natural language processing [41]. Using discrete kernels in GPs is a relatively unexplored area, possibly due to the difficulties in hyper-parameter optimization and inducing point selection. Discrete kernels have been used on graphs [42] and strings [43] (also for biological problems [23]), but so far only on relatively small datasets with full GPs. In parallel work, [33] study GPs on strings for Bayesian optimization, but their problems are small they do not use inducing points or any other scalability approach.
c: Discrete inducing point selection While the greedy inducing point selection approach has already been proposed in early work on GP regression [16], [44], [45], these works have not particularly assessed its performance on discrete GPs. To the best of our knowledge, the only work that previously studied discrete sparse GPs are [17], who also use a greedy approach, although with greedy swapping between the inducing point set and the training set, instead of greedy forward selection. We are the first to additionally study DPPs for discrete inducing point selection, as well as simulated annealing, thus enabling the use of inducing points that are not included in the training set. For an overview comparison of our proposed simulated annealing approach with the other related models in terms of uncertainty estimation, scalability, applicability to discrete domains, and selection of the inducing points, see Table 5.

VII. CONCLUSION AND FUTURE WORK
In this work, we explore different inducing point selection techniques for sparse Gaussian Processes on discrete domains. We found empirically that our proposed method using simulated annealing gives the best overall predictive performance and uncertainty estimates. This method also yields favorable runtimes on larger datasets and more non-standard likelihoods.
Moreover, we showed that our models perform competitively with SVMs on toy data as well as real-world DNA sequence data in terms of predictive performance, while offering better calibrated uncertainty estimates in some settings.
There are many directions for future work. First, developing a closer integration between discrete optimization and the marginal likelihood of the GP would improve both the approximation quality as well as the runtime of the inducing point algorithm. An orthogonal direction is a fully Bayesian treatment of the string kernel hyperparameter k, namely treating k as a random variable with a prior and performing inference on it. Finally, we would like to see existing GP software packages extend their abstractions to discrete kernels with hyperparameters that are not differentiable.
In conclusion, we would advise practitioners to use our framework on discrete problems where datasets are too large for full Gaussian Process models but uncertainty estimates are still desirable. Furthermore, in cases where likelihoods other than Gaussian or Bernoulli are required, standard regression and classification techniques are inapplicable, whereas our framework provides a principled and flexible solution.