On the Treatment of Optimization Problems with L1 Penalty Terms via Multiobjective Continuation

We present a novel algorithm that allows us to gain detailed insight into the effects of sparsity in linear and nonlinear optimization, which is of great importance in many scientific areas such as image and signal processing, medical imaging, compressed sensing, and machine learning (e.g., for the training of neural networks). Sparsity is an important feature to ensure robustness against noisy data, but also to find models that are interpretable and easy to analyze due to the small number of relevant terms. It is common practice to enforce sparsity by adding the $\ell_1$-norm as a weighted penalty term. In order to gain a better understanding and to allow for an informed model selection, we directly solve the corresponding multiobjective optimization problem (MOP) that arises when we minimize the main objective and the $\ell_1$-norm simultaneously. As this MOP is in general non-convex for nonlinear objectives, the weighting method will fail to provide all optimal compromises. To avoid this issue, we present a continuation method which is specifically tailored to MOPs with two objective functions one of which is the $\ell_1$-norm. Our method can be seen as a generalization of well-known homotopy methods for linear regression problems to the nonlinear case. Several numerical examples - including neural network training - demonstrate our theoretical findings and the additional insight that can be gained by this multiobjective approach.


INTRODUCTION
Optimization problems are at the center of a very large number of problems in science and engineering. In many situations, the task of minimizing some specified objective function is further complicated by the desire to enforce sparsity, i.e., to obtain a solution with as few non-zero entries as possible, which has several important advantages. Most notably, sparse solutions tend to be robust against noise. Furthermore, they often allow for a better analysis and interpretation due to the small number of relevant terms, for instance in the identification of governing equations [1]. Practical applications can be found in many different areas, e.g., signal processing, compressed sensing, neural network training and medical imaging [2], [3], [4], [5], [6].
In signal and image processing, one frequently considers (overdetermined) linear regression models with an 1 penalty term: where A ∈ R m×n , b ∈ R m and λ ∈ R ≥0 . This problem is often referred to as Lasso [7] or basis pursuit [8]. Since appropriately weighting the penalty term is difficult a priori, the solution is computed for multiple values of λ, for instance in homotopy methods [9], [10], [11], [12], where the entire solution path for λ ∈ R ≥0 is tracked. The method is based on the subdifferential of the 1 -norm and the resulting Karush-Kuhn-Tucker (KKT) conditions, while exploiting the fact that the structure of zero entries along the solution path is locally constant almost everywhere. In recent years, nonlinear models have become increasingly popular, in particular due to the advances in deep learning and the associated neural network training, where sparsity can help to avoid overfitting [13]. As in the linear case, the use of 1 penalty terms is a natural and common choice for obtaining sparse models, resulting in the problem min x∈R n f (x) + λ x 1 with λ ∈ R ≥0 . ( Since f is based on a nonlinear model and is generally nonconvex, (2) is much more difficult to solve than (1). In contrast to the classical 1 penalty approach, we will here interpret sparse optimization as a (non-smooth) multiobjective optimization problem (MOP), where the objectives f and · 1 are minimized simultaneously, i.e., In this context, (2) is equivalent to the weighted-sum approach, i.e., min x∈R n α 1 f (x) + α 2 x 1 with α 1 , α 2 ∈ [0, 1], α 1 + α 2 = 1 [14]. It is well-known that every solution of the weighted-sum approach is a solution of the MOP [14]. As shown in Figure 1, the weight vector α is orthogonal to the Pareto front. Varying α thus yields different compromise solutions. However, if f is non-convex, then α does no longer uniquely define a single Pareto optimal point, and we cannot find points in the non-convex parts of the Pareto front, see [14], [15] for in-depth discussions. This is also relevant in practice, e.g., in neural network training. For nonconvex loss functions, well generalizing sparse solutions may not be computable via the penalty approach. This is demonstrated in Section 4.3.
Alternative solution approaches suitable for non-convex problems are evolutionary algorithms [16], [17], scalarization methods [14], as well as continuation methods [18], [19], [20]. The latter require sufficiently smooth objective functions ensuring the existence of a tangent space as the idea is to compute a discretization of the Pareto critical set by first performing a so-called predictor step along the tangent space and then computing the nearest Pareto critical point using an optimization procedure (corrector step).
In this paper, we present a new continuation method for non-smooth MOPs with two objectives, where the first objective is twice continuously differentiable and the second objective is the 1 -norm (Section 3). In particular, the first objective may be non-convex. Although the solution set of non-smooth MOPs does not generally possess the smooth structure required for classical continuation, we will show that this specific problem class is piecewise-smooth. Thus, the solution set can be computed via classical continuation up to "kinks". We will show how these kinks can be detected and addressed within the continuation by exploiting the special structure of the optimality conditions specifically derived for this purpose. The global minimal point 0 ∈ R n of the 1 -norm can be used as the starting point.
Compared to scalarization methods (like the Adam descent method [21] applied to (2)), our approach has the advantage that we compute an approximation of the entire Pareto set instead of just a single point. In particular, we do not have to choose a penalty parameter. Compared to evolutionary methods (which also approximate the entire Pareto set), our approximation is generally closer to the true Pareto set while requiring fewer function evaluations. Furthermore, our method detects structural properties like kinks or the connectedness of the Pareto set, which remain unnoticed in (standard) evolutionary methods. On the other hand, some of the disadvantages and challenges of our method will be discussed in Section 5.
Beyond that, alternative approaches at the intersection of sparse and multiobjective optimization are, for instance, the training of (linear) SVM models using ε-constraint scalarization [22], [23], or the solution of MOPs with linear models and the 1 -norm via evolutionary algorithms [24]. In the context of sparse optimization, generalizations (for convex problem classes) of the classical homotopy methods were introduced in [25], [26], [27], [28], for instance to arbitrary convex loss-functions or convex penalty terms. Thus, the main advantage of our method is the treatment of nonlinear, non-convex problems, which will be demonstrated using several examples of varying complexity (Section 4).
Due to the loss of a total order of R k for k > 1, the solution to (MOP) is not a specific point, but the set of non-dominated points, also referred to as the Pareto set. Nondominated points x * are characterized by the fact that there . . , k}. The image of the Pareto set under F is known as the Pareto front, lying in the objective space R k .
If all objectives are continuously differentiable, then these points satisfy the well-known necessary KKT conditions, which state that if a point x * is locally Pareto optimal, there exists a convex combination of the individual gradients that is zero [18], i.e., there exists α * ∈ (R ≥0 ) k with Points satisfying the KKT conditions are denoted as Pareto critical, and the set containing all Pareto critical points is the Pareto critical set, which is a superset of the Pareto set.
As the 1 -norm does not satisfy the required smoothness assumptions for the classical KKT conditions, we need to consider subgradients instead. Definition 2.1 (Subdifferential and subgradient). Let f : R n → R be locally Lipschitz continuous, and let Ω ⊂ R n be the set of points where f is not differentiable. Then is the (Clarke) subdifferential of f in x, where Conv denotes the convex hull. An element g ∈ ∂f (x) is called a subgradient.
If f is continuously differentiable in x, then the subdifferential is the set containing the gradient, i.e., ∂f (x) = {∇f (x)}. In case of the 1 -norm, the subgradients take the following form.

Example 2.2 (Subgradients and subdifferential of the
To calculate ∂f 1 (x) for an arbitrary x ∈ R n , we first distinguish between active and inactive indices, with {j ∈ J | x j = 0} =: A(x) being the active set. It is easy to see that We denote the number of active indices by n A (x) := |A(x)| and the number of inactive indices by n 0 (x) := n − n A (x). The subdifferential is now the convex hull of M (x) := 2 n 0 (x) vectors g i containing the extreme points of the subdifferential: Consider the pointx = (2, 0, −3, 0) ∈ R 4 for example.
Analogous to the smooth case, necessary optimality conditions for MOPs containing non-smooth objective functions can be obtained [31].

Proposition 2.3 (Non-smooth KKT conditions).
Assume that F : R n → R k is locally Lipschitz continuous and x * is locally Pareto optimal. Then Similar to the smooth case, we call x ∈ R n Pareto critical if it satisfies the KKT conditions (3) and we call the set of all those points the Pareto critical set P c .

THE CONTINUATION METHOD
Our goal is to compute the entire Pareto critical set P c of (MOP-1 ) using a continuation method that exploits the structure of the 1 -norm and its subgradients. As a starting point we can use x = 0 which is always Pareto critical as the minimum of the 1 -norm, and then move along P c in the direction where the value of f decreases and x 1 increases.

Algorithm 1 PC algorithm
Input: f : R n → R ∈ C 2 , x start ∈ R n satisfying (3) Output: Discretization of the Pareto critical set P PC P PC = ∅, x 0 = x start while End of P c component has not been reached do P PC = P PC ∪ {x 0 } if P c is locally smooth then Predictor step tangential to P c : x p ← predictor(x 0 ) Corrector step with x c ∈ P c : Continuation methods (or predictor-corrector (PC) methods) yield a sequence of Pareto critical points by -starting from a Pareto critical point -performing a prediction step along the tangent space of P c and then a correction step which results in the next Pareto critical point close by, see [18] for details. For the non-smooth problem (MOP-1 ), the Pareto critical set may contain kinks such that we cannot exclusively rely on the tangent space in the prediction step. Thus, when we reach such a kink, we additionally require a mechanism that computes all possible directions in which the Pareto critical set might continue, see Algorithm 1 for a sketch. To this end, we will first have a more detailed look at the KKT conditions (3) for the special case of problem (MOP-1 ) in Section 3.1 before giving a detailed description of the predictor and corrector steps in Sections 3.2 and 3.3. We then discuss strategies for the (de-)activation of indices (i.e., how to overcome kinks) in Section 3.4, and subsequently address implementation and globalization in Section 3.5.
Since α 1 > 0, this is equivalent to Therefore, we have Since this has to hold for every active j, it follows Now assume that j is inactive. Define From (4) we obtain For the opposite direction "⇐", assume that x satisfies 2(a) and (b). Note that 2(a) implies sgn(x a )(∇f (x)) a < 0 and thus, Define

By 2(a)(i) we obtain
With 2(a)(i) it follows that for all active j In particular, Analogously, for all inactive j, we can use 2(b) to obtain As a result, the vector − α1 1−α1 ∇f (x) can be written as a convex combination of the vectors g i defined in Example 2.2. Let β 1 , ..., β M be the corresponding coefficients and define showing that (4) holds.
Theorem 3.1 states that the absolute values of all gradient entries belonging to active indices are identical. At the same time, the absolute values of gradient entries belonging to inactive indices must always be less than or equal to the absolute value of active gradient entries. Furthermore, the proof of the theorem yields an explicit formula for the KKT multipliers of Pareto critical points.

Remark 3.2.
If x is Pareto critical for (MOP-1 ) and a is an active index, then KKT multipliers corresponding to x are given by If ∇f (x) = 0 then the KKT multipliers are unique. In particular, this indicates that "kinks" in the Pareto front can only occur if ∇f (x) = 0.
As mentioned before, since (MOP-1 ) is a non-smooth problem, the Pareto critical set will generally contain points in which the tangent space does not exist. Theorem 3.1 can be used to characterize those points and to gain insight into the structure of P c by discriminating between equality and strict inequality in condition 2(b). Corollary 3.3. Let x 0 = 0 be a Pareto critical point such that the inequality in 2(b) of Theorem 3.1 is strict. Let j 1 < ... < j n A (x 0 ) be the (sorted) active indices at x 0 . Define the following two mappings that lift the active components to the n-dimensional point x and vice versa: Proof. See Section B in the supplementary material.
The previous corollary implies that if the inequality in 2(b) of Theorem 3.1 is strict for a point x 0 ∈ P c , then locally around x 0 , the Pareto critical set of (MOP-1 ) coincides with the Pareto critical set of the MOP (5), embedded from R n A (x 0 ) to R n via h 1 . The objectives of (5) are smooth around h 2 (x 0 ) (since all entries of h 2 (x 0 ) are non-zero by construction), such that we can apply the theory for smooth MOPs to see that the tangent space of the Pareto critical set P c must exist in x 0 . Thus, kinks in P c can only arise in points where equality holds in 2(b) for some j / ∈ A(x 0 ). These are precisely the points where the active set can potentially change, i.e., where an inactive index might be activated (or vice versa). Corollary 3.3 shows that it is sufficient to consider the MOP (5) if we are in a point where the inequality in 2(b) of Theorem 3.1 is strict. Based on (5), this allows us to construct the predictor (Section 3.2) and corrector (Section 3.3) as in the smooth case except for kinks.

Predictor
In the predictor step, we perform a step along the tangent space of the Pareto critical set from the starting point x 0 ∈ P c . As discussed above, we assume that an open set around x 0 exists in which the zero structure of Pareto critical points does not change, such that the original MOP locally reduces to the problem (5), where both objective functions are at least twice continuously differentiable since the 1norm is applied only to non-zero entries. This means that we can compute the predictor according to classical continuation methods, cf. [18]. More precisely, let H : be the map containing the KKT conditions of (5). Then the tangent space of the Pareto critical set is given by the first where (∇ 2 f (x)) |A(x 0 ) and ∇f (x) |A(x 0 ) denote the Hessian and the gradient of f reduced to the active indices, respectively, and s = sgn is regular, then the kernel (i.e., the tangent space) is one dimensional and we obtain a unique vector (up to scaling and sign). As discussed in [32], we expect this to be the generic case and assume that this holds for the remainder of this paper. Otherwise, the kernel could be of higher dimension which would lead to infinitely many possible directions. In this case, we could try to find a suitable direction in the kernel by using the directional derivative of the gradient and the fact that the absolute values of the entries of the gradient in the active indices must coincide (cf. 2(a) of Theorem 3.1).
If the reduced Hessian ∇ 2 f (x) |A(x 0 ) is regular, then a kernel vectorv of (6) is given bȳ The predictor of (5) is given byv 1 , whilev 2 andv 3 represent the corresponding direction in the space of KKT multipliers. The predictor of (MOP-1 ) is then v 1 = h 1 (v 1 ).

Direction
If the reduced Hessian is regular, there are only two possible choices for the direction of the predictor step, namely v 1 and −v 1 . As we want to move along the Pareto front in the direction where f becomes smaller (and the 1 -norm larger), it would be intuitive to choose the sign of ∇f (x) v 1 as an indicator, as proposed in [18] and [19]. While this mostly works, the need to cross a "turning point" may arise. These are points where we follow the Pareto critical set in the same direction but the corresponding direction in the objective space vanishes and then turns around. This phenomenon is not related to the non-smoothness of the 1 -norm but can also occur in the smooth parts, i.e., on a single activation structure (the set of points x ∈ P c with a constant active set A(x)), as shown in the following example. Figure 2 shows the Pareto critical set for this problem, which can be computed by hand in this case. We see that we have two turning points x 1 and x 2 in the Pareto critical set, where the direction in the objective space turns around. We thus use a criterion which stems from classical continuation for dynamical systems in [32]. It is based on the sign of the determinant of the augmented jacobian If the sign of det (J a ) is constant in one activation structure, then this ensures thatv has the same orientation with respect to the image of H (h 2 (x 0 ), α). Furthermore, it prevents us from walking back in the direction we came from (cf. [32]). As we will see in Section 3.4, the correct direction is known at the beginning of a new activation structure. Therefore, this is a suitable criterion to choose the direction in each predictor step. Furthermore, a simple calculation shows that where z : has to be computed in order to determine the sign of det (J a ).

3.2.2
Step size Now it remains to choose a step size h ∈ R >0 that determines how far we move along the tangent direction, i.e., that determines the predicted point x p = x 0 + h · (±v 1 ). To obtain an even coverage of the Pareto critical set, we choose a constant step size in the parameter space, i.e., with a constant τ ∈ R >0 . For an even distribution in the objective space, see [18], [19].
Since we still assume that we are in a part of the Pareto critical set where the activation structure and sign of the variables does not change, the issue may occur that h is too large if we choose it according to one of the above formulas. Therefore, we will chooseĥ = min(h,h), whereh determines the maximal length of v 1 , such that the activation structure and sign does not change, i.e., such that no active index gets inactive or crosses 0.

Corrector
The corrector step maps the predicted point onto the Pareto critical set by finding the nearest point x c ∈ P c with the same zero structure and signs as the predicted point x p , i.e., with A(x c ) = A(x p ) = A(x 0 ). The standard approach is to solve the system of equations given by the KKT conditions (4) with x p as starting point. Since the subgradient of the 1 -norm consists of M (x 0 ) = 2 n 0 (x 0 ) vectors, this can result in very large systems. Fortunately, we can exploit Theorem 3.1 to obtain a simpler way, and solve the following optimization problem: where a ∈ A(x 0 ) is some index that is currently active. In practice, we use an SQP method [33] for the solution of (C) with x p as the initial point. In the case of convergence issues, one can choose a starting point closer to the Pareto critical point x 0 , i.e., x p + λ(x 0 − x p ) for some λ ∈ (0, 1).

Changing the activation structure
So far, we have seen how the predictor and corrector can be computed when the zero structure is locally constant around x 0 . As discussed at the beginning of this section, this allows us to compute the smooth parts of the Pareto critical set in between the kinks. To compute the entire Pareto critical set (or a connected component of it), a mechanism that is able to change the activation structure A(x) is required. According to Theorem 3.1, an index j can only change from inactive to active (or vice versa) if x j = 0 and |∇f (x) j | = |∇f (x) a | with a ∈ A(x). We will call such an index j potentially active and denote the set of all potentially active indices by 3.4.1 Activate indices Let x * be a Pareto critical point with A p (x * ) = ∅. If x * is an end point of the current activation structure, then the construction of the corrector ensures that we reach it after a finite number of steps (if the step length τ is sufficiently small). If x * is not an end point, then we will generally not be able to detect it during continuation, as we move with a constant step size. Fortunately, the latter points are less likely to exist, as will be discussed briefly at the end of this section.
Note that not all potentially active indices necessarily need to be activated, such that a way to identify the relevant indices is required. Denoting the number of potentially active indices with p, we have 2 p possible activation structures. For every possible activation structure A l p ⊂ A p (x * ), l = 1, . . . , 2 p , we can compute a vectorv l = (v l 1 ,v l 2 ,v l 3 ) as in the predictor step (Section 3.2). If we assume |∇f (x * ) a | > 0, then according to 2(a) in Theorem 3.1, the correct sign of the possibly active indices is given by −sgn(∇f (x * )) j . Therefore,v l is given by with s l = −sgn(∇f (x * )) |A(x * )∪A l p . If the Hessian Analogously to Section 3.2, we obtain the direction v l 1 ∈ R n by the corresponding embedding h 1 : R n A (x * )+|A l p | → R n . If A l p is a set of potentially active indices that can actually be activated, i.e., if there is a curve γ of Pareto critical points starting in x * with the active set A(x * ) ∪ A l p , then by construction, v l 1 is the tangent vector of that curve. By analyzing the relationship between the derivative of γ at x * and the conditions in Theorem 3.1, we obtain necessary conditions for A l p to be activated. This is done in the following lemma and the subsequent corollary.
Unfortunately, without knowledge of the Pareto critical set, the existence of the curve γ in the previous lemma is almost impossible to verify in practice. Recall that Corollary 3.3 indicates that P c is piecewise smooth, where parts in between kinks are embeddings of Pareto critical sets of the smooth problems (5) into R n . Thus, the existence of γ |(0,1) follows from the smooth structure of (5). In light of this, the existence of γ implies a regularity of the smooth parts when they reach a kink. For example, roughly speaking, continuous differentiability of γ on an open neighborhood of t = 0 implies that the "curvature" of P c is bounded.
Note that the statement of the previous lemma only involves the derivative w = γ (0) and the new active set A γ . By the construction in (13), w is (up to scaling) equal to the direction v l 1 for the index set A l p = A γ \ A(x * ). Although A γ is not known in practice, we can use this relation to identify the indices that need to be activated, as shown in the following corollary. Corollary 3.6. Let x * ∈ P c \ {0} with ∇f (x * ) = 0, let γ be a curve as in Lemma 3.5 and let a ∈ A(x * ).

Remark 3.7.
According to (a) in Corollary 3.6, for all possible activation structures A l p (x * ) = ∅, there either exists only one valid direction (+v l 1 or −v l 1 ) or the tangent vector w of the corresponding curve γ has to satisfy w j = 0 for all j ∈ A p (x * ), i.e., γ has to be tangential to a vector with activation structure A(x * ), which is unlikely.
Using the conditions from Corollary 3.6, we now obtain a predictor for the suitable activation structures with the correct sign. Since we can exclude the direction we are coming from, we have to follow each of the remaining directions. This is shown in Examples 3.8 and 3.9. remain as suitable directions. By checking condition (b), the set of admissible directions can be reduced to −v 0 1 and v 3 1 . Assuming that we started in the origin, we have to walk in direction v 3 1 , i.e., both j = 2 and j = 3 become active (and positive).
(15) Figure 3b shows the relevant part of P c for this problem. Consider x * = (1, 0, 0) . Since ∇f (x * ) = (−2, −2, 2) , we have A p (x * ) = {2, 3}. According to (13), we get as possible directions: Verification of conditions (a) and (b) in Corollary 3.6 reduces the set of admissible directions to −v 0 1 , v 1 1 and v 2 1 . Assuming that we started in the origin, we have to (try to) advance with the continuation into the directions v 1 1 and −v 2 1 . A predictor step along v 1 1 directly terminates since the corrector delivers x * as the nearest Pareto critical point and therefore, the remaining direction is −v 2 1 . Thus, only j = 3 is activated. In practice, often only two possible directions remain after applying Corollary 3.6 -the old and new direction. Nevertheless, it can occur that the Pareto set splits, i.e., that there are at least 3 suitable directions. In this case, the continuation method needs to be advanced in each direction individually. An example where this is demonstrated can be found in Section A of the supplementary material. If no direction satisfies the conditions in Corollary 3.6, we have reached the end of a connected component of P c . In the smooth parts of the Pareto critical set with constant activation structure, a connected component can only end if ∇f (x) = 0, i.e., a local minimizer for the main objective has been found [34].

Remark 3.10.
Although the requirements of Corollary 3.6 do not hold if ∇f (x) = 0, similar conditions can be obtained for this case. First, note that ∇f (x) = 0 implies that A p (x * ) is the set of all inactive indices. Since we cannot predict the sign of x j with j ∈ A p (x * ) via the sign of the gradient, we have to consider every possible combination of the signs of potentially active indices for every possible activation structure A l p . The conditions of Corollary 3.6 can then be used as necessary conditions, where the 8 sign of the gradient is replaced by the opposite sign of the currently considered activation and sign structure. An additional condition can be obtained by using the fact that sgn(∇f (x) j ) = −sgn(x j ) for all j ∈ A(x) has to hold. If the number of inactive indices is large, this results in a very large number of directions that have to be checked. Therefore a more efficient approach needs to be developed in future work for very high-dimensional problems.
Finally, we note that it is in theory possible that indices are activated in points besides the end points of activation structures. In other words, the Pareto critical set may split into two branches, with one branch continuing along the old structure. Since we move through smooth parts of the Pareto critical set with discrete steps, points like this cannot be detected by our method. But based on Corollary 3.6 and Remark 3.7, one can show that such points are unlikely to exist in practice.

Deactivate indices
A situation where indices have to be deactivated is when the predictor crosses zero for some entry, i.e., sgn(x p j ) = sgn(x 0 j ) for some j ∈ A(x 0 ). In this situation, we reduce the predictor step length h toh as the largest step size such that , and the indices in A(x 0 ) \ A(x p ) are considered as inactive in the subsequent corrector step. If the corresponding corrector step (C) has a solution, then the continuation method can continue on the new activation structure. Otherwise, we choose a step length in (0,h) to find a new Pareto critical point with respect to the old active set A(x 0 ). The likelihood of an active index being falsely deactivated by this mechanism depends on the constant step size τ and the curvature of the Pareto critical set.
A second scenario where active indices can be deactivated is the case where the corrector computes a point x c where one of the previously active indices is now zero, i.e., x c i = 0 for some i ∈ A(x 0 ). This implies that some of the previously active indices A(x 0 ) are now only potentially active. In this case, we proceed as in the activation procedure described in Section 3.4.1.

Implementation & Globalization
We now have all the ingredients to implement Algorithm 1.

Activation of indices:
1) If A p (x) (Eq. (11)) is non-empty, calculate v l for all A l p ⊂ A p (x) using (13), 2) Reduce number of directions using Corollary 3.6, 3) If more than one potential direction remains, proceed in each direction. In most cases, the corrector will yield the information that P c only continues in one direction. Predictor step (when A p (x) = ∅): 1) Compute direction v 1 = h 1 (v 1 ) using (P), 2) Determine sign of v 1 using (9), 3) Compute the step size along ±v 1 using (10), 4) Determine x p = x 0 + h · (±v 1 ). Deactivation of indices: Set the indices j to zero for which sgn(x p j ) changes and reduce the predictor step length h tō h to avoid sign changes. Corrector step: Compute point x c ∈ P c using (C) with x p as the initial condition.
For the starting point, an intuitive choice would be x start = 0, which is Pareto critical and even Pareto optimal as the global minimal point of the 1 -norm. According to Theorem 3.1, the first index that has to be activated in x start = 0 is the one which corresponds to the largest absolute value of the gradient. If there are more than one maximal entries, we can apply the strategy described in Section 3.4. Of course, every other Pareto critical point would also work as a starting point, from which we would have to walk in both directions of the predictor.
By construction, our method only computes the connected component of the Pareto critical set that contains the starting point x start . Thus, if there are multiple connected components, then additional starting points are required to restart our method once the first component is computed.
In practice, this can be done by considering the objective space: Due to the definition of Pareto optimality, the new starting point should have a smaller f value and a larger 1norm or vice versa (depending on if we reached the lower end or the upper end of the image of the component). This defines an area in the objective space in which the image of the new point should lie. To actually find it, the ε-constraint method [15] can be used, which is a solution method that can be restricted to certain parts of the objective space. Alternatively, reference point methods [15] can be used, where the solution can be influenced by choosing an appropriate reference point in the objective space.
Finally, it is important to note that our method computes points that are Pareto critical, but not necessarily Pareto optimal. To obtain the actual Pareto set, we can apply a nondominance test [35]. Since the output of our method is a fine pointwise approximation of the Pareto critical set (which contains the Pareto set), this will generally result in a fine approximation of the Pareto set.

NUMERICAL RESULTS
We now study the behavior of the continuation method using three examples. We begin with two nonlinear toy examples, and then address the identification of governing equations from data via the SINDy algorithm [1]. Note that the latter is a regression problem and can consequently be adressed by existing homotopy approaches as well as the weighted sum method. Finally, the training of a neural network is studied. While the cost of computing the entire Pareto set via continuation is certainly higher in comparison to calculating only a single solution, we note that (a) the individual corrector problems can be solved rather quickly due to good initial guesses and (b) the randomness (and thus, the necessity to perform multiple runs) can be reduced.

Toy-Examples
As a first example and to highlight the important aspects of nonlinear continuation, we consider (MOP-1 ) with 3 . Figure 4 shows the result of our method. We observe that indices are activated and deactivated multiple times.  Figure 4(c) and (d), we see that only the convex part of P c can be computed with the weighted sum method. In particular, a large part where only j = 3 is active is missing.
To demonstrate the feasibility in high dimensions, let where a 1 = 1 and a i , i = 2, . . . , n, are random values in [−1, 1] (similar to [19], Problem 4). Figure 5 shows the result of our method for n = 1000 with step size τ = 0.25, which coincides with the analytical solution.

SINDy
The second example addresses the sparse identification of nonlinear dynamical systems [1], where the governing system of ordinary (or partial) differential equations is sought: Here, x(t) ∈ R m is the m-dimensional system state at time t, whereas the function g describing the system dynamics is unkown. The SINDy approach is then to represent g via a dictionary of c Ansatz functions (denoted by Θ(x)) with the corresponding coefficients W ∈ R c×m : g(x(t)) = W Θ(x(t)).
The entries of Θ are arbitrary linear and nonlinear functions such as sine and cosine or polynomials. Using N (potentially noisy) training data points ((x 1 ,ẋ 1 ), ..., (x N ,ẋ N )), the objective is to identify a sparse representation of g in terms of Θ, and thus to obtain a robust and interpretable dynamical system from data. The corresponding main objective is thus the following least squares term: As this is a convex function, the corresponding MOP is also convex and can thus be addressed using the standard penalization approach, i.e., the weighted sum method min W f (W ) +λ W 1 (where W is transformed to one large weight vector). However, as we will see, the solution may be highly sensitive to the choice of λ.
We consider the well-known Lorenz system with parameters σ = 10, ρ = 28 and β = 8 3 for which the dynamical system posses a chaotic solution: We collect N = 10, 000 training data points of a single long-term simulation to which we add white noise. The dictionary Θ consists of polynomial terms up to order 3 and is thus comprised of c = 20 terms, i.e., W ∈ R 20×3 . The Pareto front of the corresponding MOP -obtained via continuation -is shown in Figure 6(a), the corresponding non-zero weights and three trajectories are visualized in (b) and (c), respectively. We see that although the Pareto front is convex, choosing the weight in a penalization approach can be challenging. Large parts of the Pareto front are almost parallel to either the horizontal or vertical axis, such that a small change in the penalty weight will result in large jumps along the front (cf. also Figure 1). Note that the so-called knee point is not necessarily the desired compromise, as the system corresponding to the turquoise point is significantly less accurate on unseen data than the solution corresponding to the yellow point.

Neural Network
Our next example is the training process of the weights W of a feed-forward neural network (NN), i.e., y = g W (x) [13]: where K is the number of hidden layers and σ and h are activation functions. The set of weight matrices, i.e., W = {(w 0 1 , w 1 ) . . . , (w 0 K−1 , w K−1 )} (which can again be transformed into a single weight vector), is chosen such that the error on a given training data set ((x 1 , y 1 ), . . . , (x N , y N )) is minimized. In the case of a classification problem, the error is expressed as the mean of the cross entropy loss, which results in the following objective function: where M is the number of classes in the data set and y i c ∈ {0, 1} indicates whether x i belongs to class c. In addition, sparsity is often desired to increase robustness against noise.
We here consider a feed forward neural network with K = 2 hidden layers to solve the classification task of the well-known Iris data set [36]. The activation function in the hidden layers is h(x) = tanh(x), and we chose the softmax function for the output layer, i.e., σ( Each layer consists of two neurons such that the total number of weights is |W | = 25. Figure 7 shows the Pareto front obtained via the continuation method when starting in W = 0 and taking τ = 0.1 as step size. We computed three different connected components of the Pareto critical set and applied a non-dominance test afterwards. Since the images of the connected components intersect, they do not appear as distinct connected components in the objective space. Note that large parts of the front are non-convex. To compare the solution to stochastic gradient descent training, we also show several solutions obtained using the Adam descent method [21] with 5000 epochs and random initial conditions as well as varying weights λ for the penalty term W 1 . When choosing λ too large, the solution always tends towards zero (which is likely due to the non-convexity and the convergence to a local optimum).  comparable training error. More critically, for λ = 10 −3 we obtain solutions in both clusters, depending on the initial condition. Finally, solutions with intermediate sparsity are never obtained using Adam, as these appear to be -in the dynamical systems sense -less attractive for gradientdescent schemes. This demonstrates the difficulties of the standard penalization approach and the clear advantage of the continuation method, which produces high-quality trade-off solutions right away. This is further emphasized when considering the test error based on data that was not used for training. Figure 8 shows the test error for the solutions computed with the continuation method and Adam. While both approaches are capable of computing solutions with a low training error, we observe that the larger sparsity of the continuation method is significantly more robust in terms of the test error.
well-known problem in continuation methods [32] as well as in other training approaches for neural networks [37]. Open problems: 1) The computation of the (reduced) Hessian and its inverse are very expensive with increasing dimension n.
2) The necessary conditions for the activation of indices introduced in Section 3.4.1 often yield a unique direction in which to continue, but there are also cases where several directions remain and the Pareto critical set splits into multiple branches, see also the example in Section A of the supplementary material. In particular, this is the case if f possesses symmetries, as is often the case for neural networks.
3) The Hessian may become singular in particular situations, for instance when symmetries occur. 4) With increasing dimension, the ε-constraint method which we use to find a new component, becomes more expensive and sometimes numerically unstable.
In order to address issue 1), we can use a BFGS or SR1 update (cf. [33]) during the optimization in the Corrector step to obtain an approximation of the Hessian for the predictor step. Furthermore, this way we can directly compute the inverse by a rank-1-update. Experiments have shown that the SR1 update is more suitable in our case. To handle problem 2), we can apply a greedy strategy, i.e., we proceed only in the first suitable direction and ignore the others. This is reasonable in particular for splitting caused by symmetry, as the branches of P c then correspond to the same part of the Pareto front. This way, we can also avoid most of the cases where the Hessian is singular (issue 3)). Nevertheless, it is an open problem how to continue when the Hessian matrix is not regular. The only way to solve problem 4) is to not switch the component. Therefore, one could compute only the relevant connected component by computing the appropriate x start using scalarization (e.g., weighted sum solved with Adam). Alternatively, additional knowledge (e.g., regarding the influence of bias neurons on the number of connected components in a neural network) may help in finding suitable transitions to neighboring connected components. In the following example, we apply the above suggestions to train a larger neural network.
Example 5.1 (NN for reduced MNIST). As in Section 4.3, we train a feed-forward neural network. We consider the well-known MNIST data set [38]. The data set consists of images with 28×28 pixels which results in a large number of trainable parameters even for few neurons. Therefore, we only take images labeled with "3" or "6" and reduce the number of pixels to 6 × 6, similar to [39]. To classify these images, we use a neural network with K = 2 hidden layers and four neurons per layer, which results in |W | = 173 parameters to train. We choose the softplus activation function h(x) = log(exp(x) + 1). Figure 9(a) shows the Pareto front obtained via the continuation method when starting in W = 0 with τ = 0.25. Similar to Section 4.3, we computed three components and used a non-dominance test to obtain the results. The shown Adam solutions are obtained with 500 epochs. We observe again that they cluster in two places and that they are significantly less sparse. Furthermore, the Adam solutions suggest that there exists another part of the Pareto front with significantly less sparsity but a slightly lower error, which is missing in our solution and would require additional runs with additional x start . However, as can be seen in Figure 9(b), the effect on both training and test error is almost negligible and thus, these solutions are not of particular practical relevance. Observe that the test error does not increase again, as the best fitting model is not sparse.

CONCLUSION
The continuation method we present in this paper allows researchers and practitioners with an interest in sparse optimization to systematically calculate the entire set of compromise solutions between sparsity and the main objective, thereby allowing for much more insight as well as informed model selection than weighted-sum approaches or descent methods providing single Pareto optima [21], [40]. The presented approach extends the well-known homotopy methods to the much more complex nonlinear problem setting, where weighting approaches fail due to non-convexity. The optimality conditions specifically derived for this problem class allow us to use continuation methods from smooth multiobjective optimization almost everywhere, and to efficiently analyze and exploit the problem structure in points where non-smoothness occurs.
To unfold the full potential of this multiobjective approach to sparse optimization, several additional questions should be addressed in future work. These contain, for instance, how to deal with symmetries, in particular with regard to higher-dimensional problems. This is of particular interest for the training of neural networks with many training parameters. Furthermore, in order to further extend the range of possible applications, the consideration of additional -more general non-smooth -objectives as well as constraints (see, for instance, [41]) is certainly of high interest. Finally, an extension to more than one smooth objective may be useful. This could be done by combining our results with the approach in [20].
Katharina Bieker received the B.Sc. and M.Sc. degree in technomathematics from Paderborn University, Germany, in 2015 and 2017, respectively. She is currently a research assistant at the Chair of Applied Mathematics at Paderborn University and is studying towards a PhD degree with a main focus on data-driven modelling and multiobjective optimization.
Bennet Gebken received the B.Sc. and M.Sc. degree in mathematics from Paderborn University, Germany, in 2014 and 2017, respectively. He is currently a research assistant at the Chair of Applied Mathematics at Paderborn University and is studying towards a PhD degree with a main focus on the structure and computation of Pareto critical sets in multiobjective optimization.
Sebastian Peitz received his B.Sc. and M.Sc. degrees in mechanical engineering from RWTH Aachen University, Germany, in 2011 and 2013, respectively, and his PhD in Mathematics from Paderborn University in 2017. He is currently assistant professor for "Data Science for Engineering" in the Department of Computer Science at Paderborn University. His research interests are multiobjective optimization, optimal control and data-driven modeling of complex systems.
The Pareto critical set P c and its image are presented in Figure 1. We see that in the point x = (0.5, 0.0, 0.0), the Pareto critical set splits into three parts visualized in different colors. In the blue part, only indices one and three are active, two and three are active in the turquoise part, whereas all variables are active in the yellow part. Obviously, this is caused by the symmetry in x 2 and x 3 , i.e., by the fact that f ((x 1 , x 2 , x 3 )) = f ((x 1 , −x 3 , −x 2 )). Note, that the image of all three parts is equal (Figure 1 (b)). Then there is an open set U ⊆ R n with x 0 ∈ U such that x ∈ U being Pareto critical for (MOP-1 ) is equivalent to A(x) = A(x 0 ) and h 2 (x) being Pareto critical for the MOP min x∈R n A (x 0 ) f (h 1 (x)) x 1 . (1) Proof. By definition of A(x 0 ), there has to be some U 1 ⊆ R n with x 0 ∈ U 1 such that sgn(x j ) = sgn(x 0 j ) = 0 ∀j ∈ A(x 0 ), x ∈ U 1 . In particular,