Budgeted Passive-Aggressive Learning for Online Multiclass Classification

Online multiclass classification is a specific problem of online learning that performs a sequence of multiclass classification tasks given the knowledge of previous tasks. The goal is to make correct predictions for this sequence. It is generally considered a more complicated problem than its binary counterpart, online binary classification. A popular algorithm, called the passive-aggressive algorithm, was primarily proposed for binary problems and later extended as the multiclass passive-aggressive (MPA) algorithm for multiclass problems. The nature of MPA allows itself to implement the kernel trick, which enables us to make better predictions with a kernel-based model. However, this approach suffers from the curse of kernelization that causes unbounded growth of the model in memory usage and runtime. To solve the growth problem, we first introduce a resource perspective that gives an alternative and equivalent interpretation of the kernel-based MPA algorithm. Based on the resource perspective, we propose the budgeted MPA (BMPA) algorithm, which approximates the kernel-based MPA algorithm. BMPA limits the maximum number of available resources by removal and fully exploits them through a constrained optimization. We study three removal strategies and give a relative mistake bound that provides a unified analysis. Simulation experiments on various datasets are conducted to demonstrate that BMPA is effective and competitive with state-of-the-art budgeted online algorithms.


I. INTRODUCTION
Online learning aims to solve a sequence of prediction tasks given knowledge of correct targets of previous tasks [1]- [3]. On a task round, a prediction is made to the received instance, and then an update of the prediction model based on the correct target received later is performed to improve prediction for future tasks. The goal of online learning is to make accurate predictions for the sequence; as long as the prediction model can be adapted to the sequence, it does not matter if the model will converge or not. Because of its adaptive nature, online learning is suited for many practical applications that receive streaming data, such as real-time malicious URL detection [4] and ad click-through rate prediction [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar .
Numerous online algorithms have been proposed for the online setting [6]- [11]. Most of the methods are focused on the design of the update rule with a linear model. However, this simple linearity may limit the prediction performance. For a complex problem, a linear model may require the company with a superior feature extraction to achieve a good prediction performance.
Fortunately, the prevalence of support vector machines inspires the application of kernels to online learning [12]. Since the nature of many online algorithms allows themselves to be kernelized easily, e.g., the Perceptron algorithm [13] and the online gradient descent (OGD) algorithm [14], a kernel-based model, which is a nonlinear model composed of kernel functions, can usually be used to achieve a better prediction performance. This kernel-based approach is known as the kernel trick, which replaces all the inner products in an algorithm by kernel functions. While the kernel trick is simple yet effective, a kernelized online algorithm needs to store support vectors and associated combination weights together to represent a kernel-based model. It turns out that the kernelized online algorithm suffers from the curse of kernelization that results in unbounded growth in memory usage and runtime for a task round as more and more tasks are done [15]. It may make a kernelized online algorithm broken-down on some resource-insufficient occasions, e.g., making predictions on smartphones with limited computational power.
Many researchers have tried to address this issue in binary cases that deal with a sequence of binary classification tasks. Most existing works are focused on controlling the growth of a kernel-based model by restricting the number of support vectors [13], [14], [16]- [20]. Some works transform a kernel-based model into a linear model with kernel-induced feature approximation and thus avoid the growth [21], [22]. The same issue exists in multiclass cases that face a sequence of multiclass classification tasks. Although a multiclass problem is generally more complicated than a binary problem, several research works have attempted to cure the curse by controlling the growth of a model [13], [23], or using kernel-induced feature approximation [21], [22].
Among state-of-the-art online methods, a family of margin-based online algorithms called passive-aggressive (PA) algorithms has drawn lots of attention in recent years [10]; the popularity is likely because it can be used to solve many kinds of problems, such as classification, regression, and structured prediction, and the formulation for the update of a prediction model is simple and neat. The PA algorithms have facilitated fruitful applications [4], [24]- [27] and have inspired many subsequent online algorithms [28]- [32]. A PA algorithm can be kernelized to make more accurate predictions with a nonlinear model, but still suffers from the curse of kernelization making it difficult to work in applications provided with limited computational power. However, there exist only a few research works focused on overcoming the curse for PA algorithms applied to binary classification [20], [33]. It is worth to study how to overcome this issue for PA algorithms in various types of problems such as multiclass classification and structured prediction; consequently, kernel-based models can be applied safely in more and more practical applications. Moreover, for PA-based online algorithms, the study may shed some light on how to overcome their curse.
In this paper, we attempt to break the curse of kernelization for the multiclass classification version of the PA algorithm, which is referred to as the multiclass PA (MPA) algorithm in the rest of this paper. There are two main challenges of breaking the curse for the MPA algorithm. First, a kernel-based prediction model for m-class classification (m > 2) consists of m kernel-based hypotheses instead of only one hypothesis for binary classification; thus, we should somehow simultaneously limit the growth of all hypotheses to control the growth of the model. Second, since controlling the growth of a model will result in some sacrifice in the prediction performance, it is necessary to diminish the information loss in the updated model to maintain the performance.
To tackle these challenges, we propose a new budgeted method called the budgeted multiclass passive-aggressive (BMPA) algorithm to control and update all the hypotheses of a model at the same time. Concretely, we make the following contributions in this paper.
1) To provide a solid explanation of the proposed BMPA algorithm, we introduce the resource perspective that treats every encountered instance as a potential resource and the kernel-based MPA algorithm as a manager exploiting available resources for simultaneously constructing all hypotheses of a prediction model. It gives an alternative and equivalent interpretation of the kernel-based MPA algorithm. 2) Through the resource perspective, we propose the BMPA algorithm that exploits only a finite number of available resources to approximate the kernel-based MPA algorithm. Specifically, the BMPA algorithm employs a projection approach to diminish the information loss in the updated model when there exists any unaffordable resource. 3) We study three kinds of budget maintenance strategies about how to select the unaffordable resource to remove and suggest to use the smallest removal strategy that removes the resource with the smallest magnitude of the weights. The smallest removal strategy achieves a good tradeoff between prediction performance and runtime. 4) We justify the proposed BMPA algorithm and budget maintenance strategies by providing a unified relative mistake bound and conducting comprehensive empirical experiments on eight open datasets. Deep learning (DL)-based classification methods, e.g., AlexNet [34], GoogLeNet [35], and ResNet [36], are typically trained by backpropagation in a batch learning setting, which requires the entire training data to be collected before the learning task. Therefore, batch learning is also called offline learning. Moreover, machine learning (ML)-based classification methods, e.g., multiclass logistic regression [37], multiclass Gaussian process classification [38], and multiclass SVMs [39], are also typically trained in a batch learning setting. These types of DL-based or ML-based methods aim to learn a fixed classification model that achieves good generalization for new testing data. On the other hand, the kernel-based MPA algorithm investigated in this study is used to deal with a sequence of online and real-time prediction tasks. Its goal is to make accurate predictions for the sequence of input data and the real-time prediction model can change dynamically. As long as the prediction model can be adapted to the input sequence, it can support the fixed or changing model. Because the kernel-based MPA algorithm and DL-based (or ML-based) methods are developed for different prediction problem settings, we focus on comparing the proposed BMPA algorithm with budgeted and non-budgeted online multiclass algorithms in this paper.
For those practical applications receiving streaming data, the targets are often assumed to be generated from specific target function or distribution. These applications may face the problem of concept drift that results in the change of the target function or distribution over time [40], [41]. On this subject, there are methods proposed in literature to address the adaptation of concept drift over time, e.g., adaptive random forests [42], Kappa updated ensemble [43], and leveraging bagging [44]. Moreover, with the unlearning framework [45], the PA algorithm can be applied on these applications to prevent degradation in the prediction performance. Similarly, the MPA algorithm can be applied for multiclass classification. In this paper, we focus on the development of the budgeted online algorithm that makes the MPA algorithm feasible given limited computational power.
The rest of the paper is organized as follows. We discuss related work in Section II. Section III reviews the learning setting for the MPA algorithm and its kernelization. In Section IV, we present the proposed BMPA algorithm and study three budget maintenance strategies. Section V provides a unified theoretical analysis for the proposed method. We conduct empirical experiments in Section VI and conclude the paper in Section VII.

II. RELATED WORK
In this section, we mainly discuss online classification problems in online learning. For a more comprehensive survey of online learning, please refer to [46].

A. ONLINE LEARNING
Online classification can be further divided into two cases: online binary classification for a sequence of binary classification tasks and online multiclass classification for a sequence of multiclass classification tasks. Since the latter is generally considered a more complicated problem, research works are focused primarily on the former at first.
The well-known Perceptron algorithm is probably the very first online algorithm for the binary case [6]. It performs a simple additive update on the parameters of a linear model whenever it makes a wrong prediction. Several theoretical studies suggest that the number of prediction mistakes made by the Perceptron algorithm can be bounded from above [8], [47], [48]. Crammer et al. [10] proposed the passive-aggressive (PA) algorithm that utilizes the notion of margin to update the linear model. If the margin of the current example is smaller than a predefined value, the PA algorithm updates the model so that the new model achieves a unit margin on the current example by solving a constrained optimization problem. Two variants of the PA algorithm, which relax the constraints by trading off between the model change and the desired margin, are also described so that the existence of noise can be taken into consideration, e.g., mislabeled examples. Relative loss bounds for all three variants of the PA algorithm are derived. Due to the prevalence of the PA algorithms, several subsequent works are conducted. Confidence-weighted (CW) learning brings uncertainty into the linear model [25], [28], [49]. The parameter confidence is modeled as a Gaussian distribution, and the CW algorithm updates the distribution by solving a constrained optimization problem mimicking the PA algorithm. Since the CW algorithm performs poorly on nonseparable data and noisy data due to its aggressive update rules, soft confidence-weighted (SCW) learning is proposed to alleviate the situation by trading off between the distribution change and the adaptive margin [29], [50]. Alternatively, adaptive regularization of weights (AROW) employs an adaptive regularization approach for each example according to its confidence [30], [51].
Some researchers paid their attention to online multiclass classification. Lots of efforts were put into how to cleverly turn an online algorithm designed originally for a binary problem to handle a multiclass problem. For example, Crammer and Singer [52] proposed a family of additive ultraconservative algorithms, which generalized the Perceptron algorithm to multiclass problems, and provided unified mistake bounds. Similarly, multiclass extensions for the PA, CW, AROW, and SCW algorithms were developed cleverly one after another [10], [31], [50], [51]. Based on the multiclass PA algorithm, Matsushima et al. [32] further proposed the support class PA algorithm that resolves the constraint relaxation through the idea of support classes.

B. KERNEL-BASED ONLINE LEARNING
With the kernel trick, an online algorithm can employ a kernel-based model to achieve usually a better prediction accuracy. However, it suffers from the curse of kernelization that causes unbounded growth in memory usage and runtime [15]; thus, several researchers have tried to address this issue.
For binary problems, most works are focused on controlling the growth of a kernel-based model by limiting the number of support vectors (SVs). The budget Perceptron algorithm proposed by Crammer et al. [16] is the first algorithm to limit the number of SVs by a predefined value, which is called the budget. Once the number of SVs reaches the budget, it selects one of the SVs that meets some rule and replaces it by the new instance. A similar strategy is used in the NORMA algorithm [14] and the tighter budget Perceptron algorithm [17]. Dekel et al. [18] proposed the Forgetron algorithm that is the first algorithm having a relative mistake bound derived on a budget. The key ingredient making the theoretical analysis possible is the repeated shrinking of the kernel-based model followed by removing the oldest SV every time an update is performed. Later, Cavallanti et al. [19] proposed the randomized budget Perceptron algorithm that chooses at random an SV to remove and enjoys an expected mistake bound similarly to the mistake bound of the Forgetron algorithm. A merging approach was proposed to combine two SVs into another new one [53]. Wang and Vucetic [20] proposed the budgeted PA (BPA) algorithm that performs the PA algorithm on a fixed budget through a constrained optimization problem. Based on the kernel-based Perceptron algorithm, Orabona et al. [13] proposed the Projectron algorithm that takes a different route to control the growth of a kernel-based model. It either includes the current instance as an SV or projects the kernel of the instance onto the subspace spanned by kernels centered on SVs if the projection error is small enough. The number of SVs is guaranteed to be bounded yet unknown, and a relative mistake bound is derived. The authors also proposed an improved algorithm called Projectron++ that considers the notion of margin. Instead of focusing on the number of SVs, Lu et al. [22] proposed a new framework that turns a kernel-based model into a linear model with kernel-induced feature approximation. Under this framework, they proposed two algorithms with loss bounds based on the OGD algorithm. One is the Fourier OGD algorithm that approximates shift-invariant kernels by using random Fourier features, and the other is the Nyström OGD algorithm that approximates the kernel matrix by using the Nyström method.
In the literature, a few kernel-based methods have been proposed for multiclass problems, which are more complicated than binary problems. Orabona et al. [13] proposed a multiclass version of the Projectron++ algorithm and presented the relative mistake bound. However, the number of SVs cannot be known in advance. It may result in the broken-down problem when only a finite amount of computational resources is usable. In [21] and [22], both the Fourier OGD and Nyström OGD algorithms were extended to multiclass problems. For the former, the resulting numbers of features should be large enough to approximate the shift-kernel kernels well; for the latter, the matrix approximation rank should be larger to have a better classification accuracy.
Finally, TABLE 1 summarizes the difference between our proposed BMPA algorithm and the BPA algorithm [20] that motivates our work in this paper. Firstly, BPA is designed to tackle binary problems while BMPA aims for multiclass problems. Secondly, BPA studies which subset of SVs is selected for representing a removed SV and adopts a fixed removal strategy that picks the SV minimizing a regularized loss; BMPA always selects the entire set of SVs for representing a removed SV and studies which SV is selected for removal. Thirdly, as shown in FIGURE 1, we introduce the resource perspective to give an equivalent interpretation of the kernel-based MPA algorithm and provide a solid explanation of how BMPA approximates the kernel-based MPA; however, BPA includes a budget constraint heuristically to limit the maximum number of SVs without any statement on the approximation. Lastly, BMPA enjoys a relative mistake bound while there is no theoretical support for BPA.

III. PROBLEM SETTING
We consider the problem of online multiclass classification, which is to solve a sequence of multiclass classification tasks given the knowledge of correct class labels of previous tasks. On the t-th task round, the learner first receives an instance x t ∈ X and predicts its class y t ∈ Y = {1, 2, . . . , m}, m > 2, based on some prediction rules. After receiving the correct class label y t , the learner has to decide whether to update the prediction rules such that future tasks may be done well without the knowledge of future tasks. The goal is to make correct predictions as many as possible for this sequence. In the rest of this section, we review the multiclass passive-aggressive (MPA) algorithm and its kernelization. The kernel-based MPA algorithm will serve as our basics to design the budgeted algorithm that can be used in resource-insufficient occasions, e.g., performing online multiclass classification on smart devices with limited computational power.

A. MPA ALGORITHM
To tackle the problem of online multiclass classification, MPA employs an m-class discriminant comprising m linear hypotheses of the following inner product form [10], where Y = {1, 2, . . . , m} is the set of m class labels and φ : X → R d represents feature extraction that transforms instances to a desired feature space. f s : X → R measures the score of class s for an instance and is parameterized by the weight vector w s ∈ R d . The process of MPA for a task round can be summarized in four steps: scoring, prediction, evaluation, and update.
On round t, MPA first computes the scores of all classes after receiving the instance x t , where f s t (·) = w s t · φ(·) measures the score of class s at round t. Then, MPA predicts the class with the highest score After receiving the true class label y t , the corresponding prediction mistake can be determined by I ( y t = y t ) where I(·) is an indicator function. VOLUME 8, 2020 Another popular way to evaluate the prediction is using the hinge loss function, which is defined as which penalizes the prediction if the margin of the discriminant, f y (x)−max s∈Y,s =y f s (x), is less than 1. At round t, MPA evaluates the prediction of the discriminant parameterized with {w s t } m s=1 by computing the hinge loss where s t is the most misleading class on round t, Instead of {f s t } m s=1 ; (x t , y t ) , here we slightly abuse the notation by using At the end of round t, MPA solves the following constrained optimization problem for the updated determinant parameterized by which requires the update to have minimum change while only having to achieve enough score difference between the correct class y t and the most misleading class s t on the current example (x t , y t ). 1 Note that this optimization problem is a relaxation because it considers only the single incorrect class s t instead of all incorrect classes [10]. According to (4), it does not guarantee that the updated discriminant has zero hinge loss on the current example. (7) has a closed-form solution, and the resulting update rule is as follows where δ a,b is the Kronecker delta This means only weight vectors of classes y t and s t are modified by adding or subtracting the scaled feature vector, τ t φ(x t ), if the prediction suffers nonzero hinge loss, otherwise all weight vectors remain unchanged. It should be noted that the update requires to explicitly compute the feature vector φ(x t ) and thus the feature space is restricted to be finitedimensional. 1 In this paper, we focus on the version without a slack variable.

B. KERNEL-BASED MPA ALGORITHM
Through the kernel trick, MPA can be kernelized to make predictions with a kernel-based m-class discriminant. To see this, we first represent the weight vector associated with the class s on round t as a linear combination of feature vectors of support elements (SEs) as follows where w s 1 is initialized as the zero vector for all classes and is the support index set which collects indices of SEs. In this paper, an instance x i ∈ X used to construct a prediction model is called an SE instead of a support vector because we suppose that x i can be anything, e.g., a document of varied length, not just a fixed-length vector. In literature, an SE is called a support vector because it is a fixed-length vector.
The kernelization of MPA is to replace the inner product on the feature space, φ(x) · φ(x ), by a kernel function k(x, x ) that implements the inner product implicitly, Thus, we can represent all linear hypotheses of the m-class discriminant on round t as kernel-based hypotheses simultaneously which implicitly carries out the feature extraction through the kernel function k. Prediction of the class and evaluation of the hinge loss are computed by (3) and (4) using (12) and the current example (x t , y t ). The update rule of the kernel-based MPA is re-organized as follows (cf. (8a)-(8c)) Since the kernel-based MPA has to store and use all SEs and the associated combination weights, the computational burden on a single round, in terms of memory usage and runtime, may grow unboundedly as more and more rounds are done. This problem is called the curse of kernelization and demands a resource-efficient algorithm if we need the kernel-based MPA workable in practical applications, especially in resource-insufficient occasions. To solve the curse of kernelization, we propose the budgeted algorithm that limits the number of SEs in use and fully exploit them through a constrained optimization.

IV. KERNEL-BASED MPA ON A FIXED BUDGET
In this section, we first introduce the resource perspective for the kernel-based MPA, and then propose the budgeted MPA (BMPA) algorithm based on this perspective. Finally, we study three SE removal strategies for the budget maintenance.
Since the m-class discriminant adopted by the kernel-based MPA consists of m kernel-based hypotheses, we consider that hypotheses of an m-class discriminant are selected from the reproducing kernel Hilbert space (RKHS) H of a kernel k : X × X → R [54]. H is a Hilbert space of real-valued functions f : X → R endowed with an inner product ·, · H such that it satisfies (1) k(x, ·) ∈ H, ∀x ∈ X , and (2) the The inner product ·, · H induces a norm on H such that A. RESOURCE PERSPECTIVE According to (12), the kernel-based MPA employs an m-class discriminant comprising of m kernel-based hypotheses. All kernel-based hypotheses on round t are linear combinations of kernels centered on the same set of SEs corresponding to I t . However, (13a)-(13c) suggest that the update of the kernnel-based MPA includes the current instance x t as a new SE and only adjusts its combination weights; all weights corresponding to the other SEs stay the same. It implies if we can somehow adjust weights of all available SEs including x t and get a update rule different from that of the kernel-based MPA. Now, we introduce the resource perspective of a kernel-based online algorithm that treats every encountered instance x i as a potential resource that may be included as an SE or removed in the online learning process. On round t, SEs are treated as available resources used to construct the prediction model, and the corresponding weights are treated as the degrees of utilization of the SEs. During the update step, the learner first selects which instances to store as available resources (i.e., SEs) for prediction on the next round and determines their degrees of utilization; instances that are not stored as SEs are treated as unavailable resources which are removed and cannot be used on later rounds. The remaining question is whether we can get a different update rule by using the resource perspective and following the update idea of MPA, which asks for the minimum change in the prediction model while achieving enough score difference.
Let us start from the prediction model used by the resource perspective. Assume hypotheses of the m-calss discriminant on round t are some linear combinations of kernels centered on the same set of some SEs, where I t is the support index set which collects corresponding indices of SEs. We use the symbol ∼ to emphasize that even with the same sequence of tasks (14) and (12) may be different in SEs and the combination weights. Prediction of the class and evaluation of the hinge loss are computed by (3) and (4) using (14) and the current example (x t , y t ). If the hinge loss is zero, we keep the same hypotheses for the next round, In case the discriminant suffers nonzero hinge loss, i.e., ({ f s t } m s=1 ; (x t , y t )) > 0, without any constraint on the number of resources the resource perspective suggests to select all available instances including x t as available resources for prediction on the next round. In other words, we seek for new hypotheses that are some linear combinations of kernels centered on the new set of SEs, where I t+1 = I t ∪ {t} is the new support index set. It leaves us to determine the degrees of utilization a s i 's. Then, we follow the update idea of MPA: the degrees of utilization for the next round should result in minimum change in the update of the prediction model while achieving enough score difference on the current example (x t , y t ), minimize a s i ,∀i,∀s where s t = arg max s∈Y,s =y t f s t (x t ) is the most misleading class determined by using (14) and (x t , y t ). Substituting the solution of (17) back into (16), we will get the new hypotheses { f s t+1 } m s=1 for the next round. Note that (17) is different from (7) because we solve for the degrees of utilization of available resources in (17) instead of the weight vectors in (7).
Somewhat surprisingly, the resource perspective combining with the update idea of MPA turns out to have the same update rule with the kernel-based MPA. Moreover, if we set the same initialization for both approaches, we will get the same prediction results for the entire sequence of tasks. We state this result formally by the following proposition.
3) Update the hypotheses for round t + 1 by solving (17) when the hinge loss is positive, otherwise update by (15).
This setting follows the same prediction path made by the kernel-based MPA, that is, for t = 1, 2, . . ., where {f s t } m s=1 are the hypotheses employed by the kernel-based MPA on round t.
We give the proof in the Appendix. Although the resource perspective starts from a different viewpoint, Proposition 1 guarantees that the kernel-based MPA and the resource perspective considering the degrees of utilization of available resources employ the same sequence of m-class discriminants to make predictions. In other words, the resource perspective provides an alternative and equivalent interpretation for the kernel-based MPA. This bring us to the proposed BMPA algorithm that approximates the kernel-based MPA when a limited number of resources is available.

B. BUDGETED MULTICLASS PASSIVE-AGGRESSIVE (BMPA) ALGORITHM
As described in Section III-B, the kernel-based MPA suffers from the curse of kernelization. Fortunately, Proposition 1 suggests that we can approximate the kernel-based MPA based on the re-interpretation from the resource perspective. Therefore, based on the resource perspective, we propose the BMPA algorithm to limit the number of available resources used to construct the kernel-based hypotheses of an m-class discriminant by a predefined number B, called the budget. In practical applications, the budget B can be easily defined in advance by considering the available computing power. Note that a single budget B can be used to control the growth of the prediction model by simultaneously controlling the growth of all kernel-based hypotheses.
To be more clear about how to use the budget B with the resource perspective, again we assume hypotheses on round t are some linear combinations of kernels centered on some SEs, where I t is the corresponding support index set. From now on we simplify the notations by neclecting the symbol ∼. If on round t the hinge loss is zero or the number of SEs is less than the budget B, the proposed BMPA algorithm performs the update exactly like what the kernel-based MPA does in (13a)-(13c). We call this kind of update the MPA update. Otherwise, we have to deal with the case that the discriminant suffers nonzero hinge loss and the number of SEs reaches the budget, i.e., |I t | = B. BMPA will remove one of current SEs and set the remaining SEs and the current instance x t as available resources for prediction on the next round. In other words, BMPA seeks for new hypotheses that are some linear combinations of kernels centered on the set of only B SEs, where I t+1 = (I t \{r}) ∪ {t} corresponds to the B available SEs and r ∈ I t is the index of the removed SE. Then, BMPA determines the degrees of utilization for the next round by solving (17) with (20) and (21). We call this type of update the BMPA update and state it specifically by the following proposition. Proposition 2: Assume all Gram matrices of encountered instances in the RKHS H of a kernel k are strictly positive definite. If on round t the discriminant suffers nonzero hinge loss and reaches the budget, |I t | = B, BMPA updates the hypotheses as follows where r ∈ I t , The proof is given in the Appendix. It is worth to note that (22a) of the BMPA update can be decomposed into two parts as follows is the orthogonal projection of f onto the subspace span({k(x j , ·) | j ∈ I t+1 }). The first part (23a) can be interpreted as an MPA update (cf. (13a)-(13c)) and the second part (23b) corresponds to the projection of the removed SE x r . Although in this subsection we start from removing x r and then determine the degrees of utilization of B available SEs, (23a)-(23b) suggest that the BMPA update can be interpreted in another way: the BMPA update first performs the MPA update and then remove the unaffordable resource x r by projection. The projection P t [α s r k(x r , ·)] preserves the information of x r , so it minimizes the information loss in the prediction model when x r is removed. We summarize the proposed BMPA algorithm in Algorithm 1.

C. BUDGET MAINTENANCE
Although the proposed BMPA algorithm requests that the index of the removed SE on round t should be selected from current support index set I t without any other specific rule, in practice the determination of the removed SE plays an important role. In this subsection, we study three removal strategies for the budget maintenance.

5:
Predict the class labelŷ t = arg max s∈Y f s t (x t ) 6: Receive the correct class label y t 7: Compute the hinge loss t = ({f s t } m s=1 ; (x t , y t )) 8: if t = 0 then MPA update 9: else if |I t | < B then MPA update 12: else BMPA update 16: Determine the removed index r = r t ∈ I t 17: I t+1 = (I t \{r}) ∪ {t} 18: Update f s t+1 by (22a)-(22c) 19: end if 20: end for older tasks may contain less information to make accurate predictions for future tasks. This means the oldest SE may contain the least information for future predictions. It suggests removing the oldest SE to maintain the budget, which is the smallest element in I t .
The main computational burden of BMPA is located at the case where a BMPA update is required. The time complexity of BMPA-O is dominated by the computation of the matrix inverse K −1 r t . Computing K −1 r t directly from K r t will take O(B 3 ) time and is not efficient. Instead, we use a recursive approach to compute the matrix inverse by exploiting the decremental and incremental natures of the Gram matrix. We compute K −1 r t from the previous matrix inverse K −1 r t−1 in O(B 2 ) time by two recursive updates: the first one corresponds to shrink the matrix size to (B − 1) × (B − 1) and then the latter one is to enlarge the size back to B × B. Each recursive update takes O(B 2 ) time, and therefore the total time complexity of BMPA-O is reduced to O(B 2 ). On the other hand, the space complexity of BMPA-O is O(B 2 ) because we need to store the matrix inverse, which dominates the main memory cost.

2) PROJECTION REMOVAL (BMPA-P)
Since a BMPA update can be interpreted as an MPA update followed by a projection contributed from the removed SE x r (cf. (23a)-(23b)), a smaller total projection error leads to less information loss in the prediction model. Note that the magnitude of the total projection error on round t has an analytic form, Hence, it suggests removing the SE with the least amount of the total projection error on each round, The time complexity of BMPA-P is dominated by the computation of the matrix inverse K −1 q . Moreover, to determine r t it needs to compute K −1 q for every q ∈ I t . Again, we exploit the decremental and incremental natures of the Gram matrix to compute the matrix inverse. Each K −1 q is computed from K −1 r t−1 in O(B 2 ) time by two recursive updates that are used by BMPA-O. Therefore, the total time complexity of BMPA-P is O(B 3 ). On the other hand, the space complexity of BMPA-P is O(B 2 ) because the storage of K −1 r t−1 and K −1 q dominates the main memory cost.

3) SMALLEST REMOVAL (BMPA-S)
If the weight associated with some SE is far from zero, e.g., |α s j | 0 for some class s, the score of class s may change dramatically because of the removal of the associated SE, x j . Thus, it may degrade the prediction accuracy. If the weights associated with some SE are almost zero, e.g., α s j ≈ 0, ∀s ∈ Y, we can safely remove the associated SE, x j , without changing the scores of all classes and the prediction accuracy too much. It implies that the weights associated with the SEs may contain the important information for future predictions. This suggests removing the SE with the smallest sum of squared weights, It is worth noting that BMPA-P becomes BMPA-S if k qq − k q K −1 q k q is assumed to be a constant value for all choices of SEs in I t . It means BMPA-S can be treated as an approximation of BMPA-P. Remark: These removal strategies are feasible plans to remove less important SEs and keep more important SEs for the proposed BMPA algorithm. These strategies are introduced based on the resource perspective proposed in this study. VOLUME 8, 2020

V. THEORETICAL ANALYSIS
In this section, we give a theoretical analysis of the proposed BMPA algorithm. Specifically, we provide a unified relative mistake bound that is applicable to any SE removal strategy as long as the removed SE is selected properly, i.e., r t ∈ I t , and then conduct an empirical study of the theoretical bound on all three removal strategies studied in Section IV-C.
Let us first re-present the proposed BMPA algorithm. As long as the hinge loss is zero or the number of SEs is less than the budget B, BMPA performs the MPA update. The number of SEs therefore grows with every nonzero hinge loss and eventually reaches the budget B. Once the number of SEs reaches the budget B and the discriminant suffers nonzero hinge loss, BMPA first adds an SE by performing the MPA update, and then it reduces the number of SEs to B by projecting the kernel centered on the removed SE onto the subspace spanned by the kernels centered on the new set of SEs, i.e., span({k(x j , ·) | j ∈ I t+1 }).
Let hypotheses of the m-class discriminant on round t be where is the support index set and s t = arg max s∈Y,s =y t f s t (x t ) is the most misleading class. Note that the combination weights are denoted by α s i,t 's instead of α s i 's because their values may change when the hypotheses are updated. Denote by I t the index set obtained on round t after applying the MPA update, that is, Let {f s t } m s=1 denote the corresponding set of hypotheses, where α s t,t = (δ s,y t − δ s,s t )τ t and τ t = . Now, define I t+1 to be where r t ∈ I t is the index of the removed SE. Note that as long as r t is selected from I t , it does not affect the analysis. Then, the corresponding hypotheses {f s t+1 } m s=1 are with the corresponding projection errors where P t [f ] is the orthogonal projection of f onto the subspace span({k(x j , ·) | j ∈ I t+1 }). Denote by T the number of rounds in the sequence and by J the set of rounds on which the discriminant suffers nonzero hinge loss, namely, is a subset of J . Now, we are ready to present the relative mistake bound for BMPA. Theorem 1: Let H be the RKHS of a kernel k : X × X → R and (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x T , y T ) be a sequence of examples where x t ∈ X , y t ∈ Y = {1, 2, . . . , m}, m > 2, and k(x t , x t ) = R 2 , R > 0, for all t. Then, for any competing mclass discriminant comprising m hypotheses g s ∈ H, s ∈ Y, the number of prediction mistakes made by BMPA on this sequence is bounded above by (34) We give the proof in the Appendix. This bound is an upper bound measuring the prediction performance of BMPA relatively to any competing m-class discriminant comprising m hypotheses selected from the RKHS H of a kernel k. As long as r t is selected from I t , the analysis is applicable no matter what removal strategy is executed when the budget is full.
The bound mainly consists of three terms. The first term measures the size of the competing discriminant, the second term evaluates the prediction performance of the competing discriminant applied to the entire sequence, and the third term assesses the influence of the projection errors performed by the proposed BMPA algorithm as well as the size of the competing discriminant. If either the budget is not reached, i.e., |I t | < B, or there is no budget constraint at all, i.e., B = ∞, BMPA always performs the MPA updates. f s t 's become zero, and the theorem states the same mistake bound for the kernel-based MPA [10]. If the budget of BMPA is finite and reached, BMPA performs the BMPA update to remove an SE for every nonzero hinge loss. The effect of SE removal contributes to the projection errors in the third term of the mistake bound and thus makes the bound less tight.
Since the competing m-class discriminant can be chosen in hindsight from the RKHS of the kernel k, one typical choice to gain some insight is to choose the m fixed hypotheses that achieve the best prediction performance for the entire sequence of examples. In this case, the second term has the least influence on the bound and the effect of the first term diminishes gradually. Moreover, the third term involving projection errors dominates the mistake bound. This suggests that the removal strategy plays an important role for determining the prediction performance of BMPA.
It is worth to note that we can take one step of the proof back and have a relative loss bound for BMPA, although the theorem states the relative mistake bound. The loss bound is the same as the mistake bound as suggested by the last line of the proof in the Appendix.
In literature, there are a few online algorithms that break the curse of kernelization for multiclass problems and have theoretical analyses. The Projectron++ algorithm enjoys a  mistake bound controlled by a sparseness parameter η [13]; however, the MFOGD and MNOGD algorithms have regret bounds that are fallen in a slightly different realm [22].
Remark 1: Examples of kernels satisfying the condition are the Gaussian kernel k(x, x ) = exp(−γ x − x 2 2 ) and the exponential kernel k(x, x ) = exp(−γ x−x 2 ) where γ > 0 and x, x ∈ X ⊆ R d [38]. Note that R = 1 for both examples.
Remark 2: Intuitively, BMPA-P attempts to reduce the total projection errors in a greedy way on each round. As the implication from the theoretical analysis, BMPA-P may achieve a bound tighter than BMPA-O and BMPA-S.

A. EMPIRICAL STUDY OF THE THEORETICAL BOUND
Because the mistake bound described in Theorem 1 is obtained by using inequalities in several times, it is by no means tight. However, we still can gain some insights through the following empirical study in which only the usps dataset is used to illustrate the results. The other datasets have similar tendencies as that demonstrated with the usps dataset. Unless specifically mentioned, we follow the same experimental setting described in Section VI. The m competing hypotheses {g s } m s=1 are chosen such that they minimize the cumulative squared hinge loss of the sequence of T examples. FIGURE 2 demonstrates the results of three removal strategies described in Section IV-C as well as the kernel-based MPA algorithm, which is abbreviated as MPA. We set B = 100 for FIGURE 2(a)-2(c). FIGURE 2(a) depicts the cumulative mistake rate and FIGURE 2(b) shows the rate of mistake bound, which is the mistake bound divided by the number of examples. We also plot curves of FIGURE 2(a) to FIGURE 2(b): they cannot be distinguished visually and are very close to the horizontal axis. Although FIGURE 2(b) shows that the mistake bounds are not tight at all, the bounds indeed show a relationship same as that of the cumulative mistake rates: BMPA-O > BMPA-P ≈ BMPA-S > MPA. To see the interaction of three terms shown in the mistake bound, we compare in FIGURE 2(c) the rate of individual term, which is the square-root term divided by the square root of the number of examples. To be more specific, we define, after round t : rate of Term 2  where ; (x t , y t )) > 0} records rounds with nonzero hinge loss so far. FIGURE 2(c) shows that the first term diminishes gradually, the second term has the least influence, and the third term dominates the bound. This confirms our previous claim and suggests that the removal strategy is important in determining the prediction performance. Lastly, FIGURE 2(d) verifies that the relationship remain among different budgets (cf. FIGURE 4(e)).
Remark: Intuitively, if we can make the mistake bound tighter, the number of prediction mistakes may be fewer. Since a BMPA update consists of an MPA update and a projection contributed by the removed SE, the corresponding projection error contributes to the third term of the mistake bound and thus make the bound less tight. This suggests us to find a way to minimize the total projection error term in the bound, it contradicts to the nature of online multiclass classification because we cannot anticipate what future tasks are when we deal with the current task. Therefore, BMPA-P serves as a turnaround to make the bound tighter by minimizing the mistake bound so far. The results in Section VI-B show that BMPA-P achieves the best prediction performance among the three removal strategies studied in this paper. However, in practical applications, we may take the execution time into account. This suggests that BMPA-S is better employed for practical applications (cf. FIGURE 5).

VI. EXPERIMENTS
We conduct experiments to demonstrate the effectiveness of the proposed BMPA algorithm. Overall, there are three goals to achieve through the experimental results. Firstly, we validate that BMPA indeed approximates the kernel-based MPA algorithm, which is abbreviated as MPA in this section. Secondly, different removal strategies of BMPA are compared. Finally, we show that BMPA is competitive with state-ofthe-art budget online algorithms.  2 2 where the parameter γ is selected from {2 −10 , 2 −8 , . . . , 2 10 }. To have a fair comparison, we use the multiclass version of the Perceptron algorithm with max-score update (abbreviated as MPerceptron) [52] to determine the value of γ . The determination rule is similar to that used in [13]. Specifically, for each dataset, the parameter γ is selected to have the smallest average cumulative mistake rate with MPerceptron and is used for all the other algorithms. All algorithms are implemented in MATLAB and simulated by MATLAB R2017a in a personal computer equipped with 16GB RAM and Intel Core i7-4790 CPU at 3.6GHz. The operating system is Windows 7 Professional 64bit. For the elapsed time measured in Section VI-B, we enable single-thread processing.
Remark: The kernel-based PA algorithm suffers from the curse of kernelization that may fail to work when t is very large, and is remedied by BPA as demonstrated with a large-scale dataset of one million examples in the experiments section of [20]. The kernel-based MPA algorithm, which is the multiclass extension of the kernel-based PA algorithm, also suffers from the curse of kernelization and is fixed by the proposed BMPA algorithm. Thus, BMPA can be treated as the multiclass extension of BPA. So, BMPA has the same space complexity and m times of the time complexity in comparison with BPA. Both time and space complexities are constant once the budget B is set. Therefore, BMPA is feasible for large-scale datasets.

A. APPROXIMATION OF MPA
We use the oldest removal strategy to verify the approximation; however, the observations are also applicable to the other strategies. The mistake rate of MPA drops when the number of examples increases and is almost always the lowest as compared with BMPA-O. The mistake rate of BMPA-O at the beginning is the same as that of MPA since the number of SEs at that time is smaller than the predefined budget. When later the number of SEs reaches the budget, the drop of the mistake rate becomes slower. Moreover, the curve of BMPA-O with a larger budget is lower and is much closer to that of MPA. Eventually, BMPA-O and MPA are not distinguishing if the budget is large enough. We also plot the mistake rate versus the budget for BMPA-O in FIGURE 4. It is evident that the mistake rate converges gradually to that of MPA with the increase of the budget. We conclude that BMPA is an approximation of MPA.

B. REMOVAL STRATEGIES
Because the removal strategy plays a central role in practice, here we compare the performance of the three proposed strategies. FIGURE 4 depicts the cumulative mistake rate as a function of the budget B for MPA and BMPA. Since there is no SE removal in MPA, its mistake rate is plotted as a horizontal dash-dotted line. It is clear that the mistake rate for all strategies converges gradually, at a different pace, to that of MPA as the budget increases. This result supports our claim that BMPA is an approximation of MPA. As expected from the relative mistake bound, BMPA-P has almost the lowest mistake rate among three proposed strategies. Moreover, BMPA-S has a slightly worse or similar performance as compared with BMPA-P. This somehow justifies that BMPA-S approximates BMPA-P.
We also plot in FIGURE 5 the elapsed time as a function of the mistake rate for MPA and BMPA. FIGURE 5 shows that all three removal strategies trade the mistake rate for the elapsed time. Among them, BMPA-S achieves the best tradeoff, and the gap in the elapsed time is more tremendous on a lower mistake rate. Furthermore, BMPA runs more efficiently for sequences with a large number of examples and high dimensions, e.g., protein and mnist. We can observe that their curves of BMPA are mostly located at the bottom-right region with respect to MPA.

C. COMPETITIVENESS
We take BMPA-S, which acheves the best trade-off in Section VI-B, as the representative of BMPA and compare it with the following state-of-the-art budgeted online algorithms: • MRBP: the multiclass randomized budget Perceptron algorithm with random removal strategy [13], [19]; • MProjectron++: the multiclass Projectron++ algorithm using the projection strategy [19]; • MFOGD: the multiclass Fourier online gradient descent algorithm using random Fourier features [22]; • MNOGD: the multiclass Nyström online gradient descent algorithm using Nyström-based features [22].
The above algorithms are essentially designed for online multiclass classification. MRBP represents the multiclass extension of the randomized budget Perceptron algorithm in the nature manner [13], [19]. In particular the max-score update is used. MRBP discards an SE at random from the current set of SEs when the number of SEs reaches the budget and includes the current example as a new SE. Since the discarding of MRBP involves a randomized mechanism, we run it ten times and then take the average for each sequence.
To have a more fair comparison to MRBP, we also implement BMPA with random removal (abbreviated as BMPA-R) that selects an SE to remove at random. MProjectron++ is the  multiclass extension of the Projectron++ algorithm [13]. MProjectron++ adopts a projection approach to bound the number of SEs; basically, it projects the kernel centered on the current instance to the subspace spanned by kernels centered on SEs under proper conditions involving a sparseness parameter η. If the number of SEs reaches the predifined budget, the projection is always executed regardless of the conditions. For each budget, η is selected from {0.2, 0.4. . . . , 1.4} and is set to the value attaining the lowest mistake rate. MFOGD approximates shift-invariant kernels FIGURE 6. The cumulative mistake rate versus the budget for budgeted and non-budgeted online multiclass algorithms. The two dotted horizontal lines corresponding to MPerceptron and MPA respectively are provided as baselines for comparison. by using random Fourier features and learns the corresponding linear hypotheses by the online gradient descent algorithm [22]. MNOGD approximates the kernel matrix by using the Nyström method and learns the corresponding linear hypotheses by the online gradient descent algorithm [22]. Following the parameter recommendation in [22], we set the number of Fourier components as 6B for MFOGD and the rank approximation parameter as 0.4B for MNOGD. The gradient descent step size η for both algorithms is drawn randomly from {2, 0.2, . . . , 0.0002}. MFOGD is run ten times for each sequence due to the random features. We also include MPerceptron and MPA for comparison. Both algorithms make use of the kernel trick and belong to non-budgeted methods.
As a summary, we summarize the properties of these budgeted online algorithms in TABLE 3. The second column indicates what algorithm prototype they adopt, and the third column records the form of the employed hypotheses. Among the last three columns that are related to SEs, the first one indicates the adoption of a removal strategy, the second one records how the removed SE is selected, and the third one describes how to preserve the information of the removed SE. It is worth to note that both MFOGD and MNOGD do not remove any SE and transform every SE into kernel-induced feature vectors. FIGURE 6 depicts the cumulative mistake rate as a function of the budget B for budgeted online algorithms as well as non-budgeted methods. From the curves, we can draw some observations as follows. First of all, MPA outperforms MPerceptron since the mistake rate is lower for MPA than for Mperceptron. Secondly, the mistake rate of all budgeted algorithms drops as the budget increases. BMPA converges to MPA while there is a performance gap between other budgeted algorithms and MPA. Thirdly, BMPA generally outperforms other budgeted algorithms since it achieves the lowest mistake rate for each dataset and each budget. Lastly, the comparison of MRBP, BMPA-R, and BMPA-S suggests that both the removal strategy and the non-budgeted algorithm itself are important to have a good prediction performance. All in all, we conclude that BMPA is competitive with state-of-the-art budgeted online algorithms.
Finally, as a summary, TABLE 4 compares the time and space complexities per update for budgeted and non-budgeted online multiclass algorithms evaluated in this section. Because MPerceptron and MPA are non-budgeted algorithms, their time and space complexities depend on the number of SEs, |I t |, and grow gradually. On the other hand, budgeted algorithms, including the proposed BMPA algorithm, have constant time and space complexities once the budget B is given. Although none of the removal strategies associated with the proposed BMPA algorithm shows benefits based on the complexity comparison, BMPA-S achieves the best prediction performance in general according to the results demonstrated in FIGURE 6.

VII. CONCLUSION
Based on the kernel-based MPA algorithm, we propose the BMPA algorithm that controls the growth of a kernel-based model by limiting the maximum number of available support elements via removal and fully exploits them through a constrained optimization. BMPA is derived from the resource perspective, which provides an alternative and equivalent interpretation of the kernel-based MPA algorithm. BMPA can be treated as an approximation to MPA justified by the resource perspective. Moreover, it breaks the curse of kernelization that makes kernel-based models fail in resource-insufficient occasions. We study three removal strategies along with a unified theoretical analysis for prediction mistakes made by BMPA. Among them, BMPA-S achieves the best trade-off. We conduct simulation experiments on open datasets and show that BMPA is effective and competitive. In future works, we plan to study the combination of the resource perspective and the budget idea with more online algorithms, particularly those based on optimization. Besides, we will study the integration of the proposed BMPA algorithm with other kernels and the extension of the proposed BMPA algorithm to scenarios with dynamic computational power.
Note that the proposed BMPA algorithm is designed to deal with online problems where the sequences of examples are generated without any statistical assumption. There is another research thread dealing with the sequences of examples generated from a fixed source (or a source model with specific parameters), and this is the primary focus of batch learning. Although the proposed BMPA algorithm may be used for this kind of problem, it is not the main focus of this paper. We leave it as one of the future works. On the other hand, it has come to our attention that some of the most representative computational intelligence algorithms are bioinspired, e.g., monarch butterfly optimization [56], elephant herding optimization [57], the earthworm optimisation algorithm [58], and the moth search algorithm [59]. These types of algorithms may be used to design powerful online algorithms and are worth for further study.
HENRY HORNG-SHING LU received the B.S. degree in electrical engineering from National Taiwan University, in 1986, and the Ph.D. degree in statistics from Cornell University, in 1994. He is currently a Professor with the Institute of Statistics, National Chiao Tung University, Taiwan. He serves as the Vice President for Academic Affairs. His findings have been published in a wide spectrum of journals, conference papers, and book chapters. He also co-edited the Handbook of Statistical Bioinformatics and Big Data Analytics, published by (Springer, in 2011 and 2018). His research interests include statistics, image science, bioinformatics, and big data analytics. He analyzes different types of data by developing statistical methodologies for machine learning with the power of statistical inference and computation algorithms. He is an Elected Member of the International Statistical Institute (ISI).