By Topic

IEEE Quick Preview
  • Abstract

SECTION I

INTRODUCTION

A sparse solution is obtained by minimizing a convex function under the constraint of Formula$L_{1}$ norm. This is a typical case of linear regression, where the target function is a quadratic function. When a true solution is sparse, it is often recovered properly even when the number of observations is small. This has been extensively studied under the name of compressed sensing, see for example, Chen, Donoho and Saunders [1]; Candes and Wakin [2]; Donoho and Tsaig [3]; Bruckstein, Donoho and Elad [4]; Candes, Romberg and Tao [5]; Candes and Tao [6]; Elad [7]; Eldar and Kutyniok [8]; Donoho [9]; and many others.

There are a number of algorithms to obtain a sparse solution efficiently in the linear regression problem, such as LARS (Efron, Hastie, Johnstone and Tibshirani [10]), LASSO (Tibshirani [11]) and their variants. They can be extended from a quadratic cost function to a general convex function (Hirose and Komaki [12]; Friedman, Hastie and Tibshirani [13]). Since a convex function endows a dually flat Riemannian geometrical structure to the manifold of the parameter space (Amari and Nagaoka [14]; Amari and Cichocki [15]), information geometry is useful for solving the convex optimization problem. Hirose and Komaki [12] gave an extended LARS algorithm to be applicable to this problem, where the dually flat geometrical structure plays a fundamental role. See also Yukawa and Amari [16].

The present paper intends to elucidate geometrical properties of the solution path of a convex optimization problem under the parametric constraint that the Formula$L_{1}$ norm is limited to Formula$c$ or equivalently the Lagrangean problem with Lagrangean multiplier Formula$\lambda$, where Formula$c$ or Formula$\lambda$ changes continuously. A main result is to show that a new version of the extended LARS is a steepest descent algorithm under the Minkovskian gradient, that is, the steepest direction of a function under the Minkovskian norm, the Formula$L_{1}$ norm in the present case. Since the target function is convex, the gradient method is robust in the sense that numerical calculation errors are automatically corrected. We show that our Minkovskian gradient method is applicable to the under-determined case of compressed sensing, too, while the Hirose-Komaki algorithm [12] is applicable only to the over-determined case.

The present paper is organized as follows: After Introduction, we show a constrained optimization problem and its Lagrangean formulation together with a few typical examples in Section II. Section III is devoted to explanation of information geometry derived from a strictly convex function. Here, the Riemannian metric and dually coupled flat affine connections are introduced. They define two types of geodesics. We further explain a generalized Pythagorean theorem and projection theorem. In the particular case when the convex function is a quadratic function, the manifold is Euclidean and the two types of geodesics are identical. But the dually coupled affine coordinates play an important role even in this case.

In Section IV, we show that the constrained optimal solution is given by the dual geodesic projection of the unconstrained optimal solution to the Formula$L_{1}$ constraint set, which is convex in the primal coordinates. Section V explains that the inverse projection gives a partition of the manifold, and its monotonic property is proved. Section VI gives the equation of a solution path from which the least equi-angle property of LARS is proved. Section VII introduces a Minkovskian gradient of a function, and proves that the extended LARS is a Minkovskian gradient descent method. We give an algorithm to obtain a solution path starting from the sparsest solution of 0, adding non-zero components one by one, like LARS. However, our purpose is to elucidate geometrical properties of the problem rather than to propose an efficient numerical algorithm. Hence we do not show any numerical examples. See Hirose and Komaki [12], for example, to see how the extended LARS works. It is a future problem to elaborate algorithmic detail of the Minkovskian gradient method. Section VIII shows that the Minkovskian gradient method works even for the under-determined problem, where the target function is convex but not strictly convex. Section X states conclusions.

SECTION II

OPTIMIZATION OF CONVEX FUNCTION UNDER Formula$L_{1}$ CONSTRAINT

We study the problem of minimizing a convex function Formula$\varphi({\mmb{\theta } }), {\mmb{\theta } }= \left (\theta _{1},\ldots, \theta_{p} \right)\in {\mbi{R} }^{p}$, under a sparsity constraint. We use the Formula$L_{1}$ constraint, where the Formula$L_{1}$ norm of Formula${\mmb{\theta } }$ is constrained within a constant Formula$c$, Formula TeX Source $$F (\mmb{\theta }) = \sum\limits \left \vert \theta _{i} \right \vert \leq c.\eqno{\hbox{(1)}}$$ Then, the problem is formulated as Formula TeX Source $${\rm Problem}~ P_{c} : {\rm Minimize} ~ \varphi (\mmb{\theta })~ {\rm under} ~F(\mmb{\theta }) \leq c.\eqno{\hbox{(2)}}$$ We can solve it by the Lagrange method, which is formulated as Formula TeX Source $${\rm Problem} ~ P_{\lambda } : {\rm Minimize} ~ \varphi (\mmb{\theta }) +\lambda F (\mmb{\theta }),\eqno{\hbox{(3)}}$$ where Formula$\lambda$ is a Lagrangian multiplier. We state three well-known examples.

A. Linear Regression Under Gaussian Noise

Given Formula$n$ design vectors of Formula$p$ dimensions, Formula TeX Source $${\mbi{x} }_{a} = \left (x_{a1},\ldots, x_{ap}\right), \quad a=1,\ldots, n,\eqno{\hbox{(4)}}$$Formula$n$ responses Formula TeX Source $$y_{a} = \sum\limits ^{p}_{i=1} x_{ai} \theta _{i} + \varepsilon _{a}, \quad a=1,\ldots, n,\eqno{\hbox{(5)}}$$ are observed, where Formula${\mmb{\theta } }= \left (\theta _{1},\ldots, \theta _{p}\right)$ are parameters to be estimated and Formula$\varepsilon _{a}$ are independent 0-mean Gaussian noises subject to Formula$N \left (0,\sigma ^{2}\right)$.

The maximum likelihood estimator is obtained by minimizing the sum of squared errors Formula TeX Source $$\varphi ({\mmb{\theta } }) = {{1}\over {2}} \sum\limits \left (y_{a} - {\mmb{\theta } } \cdot{\mbi{x} }_{a} \right)^{2} = {{1}\over {2}} \left \vert{\mbi{y} } - X {\mmb{\theta } }\right \vert ^{2},\eqno{\hbox{(6)}}$$ where Formula TeX Source $${\mbi{y} } = \left (\matrix{y_{1} \cr \vdots \cr y_{n}} \right), \quad X = \left (\matrix{{\mbi{x} }_{1} \cr \vdots \cr {\mbi{x} }_{n}} \right), \quad{\mmb{\theta } } = \left (\matrix{\theta _{1} \cr \vdots \cr \theta _{p}} \right).\eqno{\hbox{(7)}}$$ This is a quadratic function rewritten as Formula TeX Source $$\varphi ({\mmb{\theta } }) = {{1}\over {2}} {\mmb{\theta } }^{\prime}G {\mmb{\theta } }- {\mbi{y} }^{\prime} X {\mmb{\theta } } + {{1}\over {2}}{\mbi{y} }^{\prime} {\mbi{y} },\eqno{\hbox{(8)}}$$ where Formula TeX Source $$G = X^{\prime} X,\eqno{\hbox{(9)}}$$Formula${}^{\prime}{}$” denoting transposition. When the number Formula$n$ of observations is larger than the number Formula$p$ of parameters, the problem is over-determined in general and Formula$\varphi (\mmb{\theta })$ is a strictly convex function. In this case, Formula$G$ is a positive-definite matrix and we have a unique minimizer Formula${\mmb{\theta } }^{opt}$ of Formula$\varphi$ which satisfies Formula TeX Source $$\nabla \varphi \left ({\mmb{\theta } }^{opt}\right) = 0,\eqno{\hbox{(10)}}$$ where Formula$\nabla$ is the gradient operator, Formula$\nabla ={{\partial }/ {\partial \theta _{i}}}$. The minimizer is given by Formula TeX Source $${\mmb{\theta } }^{opt} = G^{-1}X{\mbi{y} }.\eqno{\hbox{(11)}}$$

When Formula$n< p$, the problem is under-determined. In this case, Formula$G$ is positive-semidefinite and its rank degenerates. The minimizer of Formula$\varphi$ is not unique. The minimizers of Formula$\varphi$ form an affine subspace. We need to use a sparsity constraint for obtaining a unique sparse solution. We study the over-determined case first, but our Minkovskian gradient method is applicable to the under-determined case as well.

B. Non Gaussian Noise

Consider the case where noise Formula$\varepsilon _{a}$ is not Gaussian and its probability density function is subject to Formula TeX Source $$p(\varepsilon) = \kappa \exp \left \{-\psi (\varepsilon)\right \},\eqno{\hbox{(12)}}$$ where Formula$\psi (\varepsilon)$ is a convex function and Formula$\kappa > 0$ is a constant. Typically, Formula TeX Source $$\psi (\varepsilon) = \vert \varepsilon \vert ^{k}\eqno{\hbox{(13)}}$$ and is the Laplace noise for Formula$k=1$. The linear regression problem is to minimize Formula TeX Source $$\varphi ({\mmb{\theta } }) = \sum\limits _{a} \psi\left (y_{a} -{\mbi{x} }_{a} \cdot {\mmb{\theta } }\right).\eqno{\hbox{(14)}}$$ The target function is a convex function of Formula${\mmb{\theta } }$ but is not quadratic except for the Gaussian case of Formula$k=2$.

C. Logistic Regression

A logistic regression problem has binary responses, Formula$y_{a}=0, 1$, Formula$a=1, \ldots, n$, and its probability is given by the logistic curve, Formula TeX Source $${\rm Prob} \left \{y_{a}=y\right \} = \exp\left \{\xi _{a} y-\psi (\xi _{a})\right \}, \quad y=0, 1,\eqno{\hbox{(15)}}$$ where Formula$\psi (\xi)$ is the normalization term given by Formula TeX Source $$\psi (\xi) = \log \left \{1+ \exp (\xi)\right \}.\eqno{\hbox{(16)}}$$ The parameter Formula$\xi _{a}$ for Formula$y_{a}$ is given by Formula TeX Source $$\xi _{a} = {\mbi{x} }_{a} \cdot {\mmb{\theta } }.\eqno{\hbox{(17)}}$$ The loss function is the negative of the sum of Formula$\log$ probabilities (15) Formula TeX Source $$\varphi (\mmb{\theta }) = -\sum\limits y_{a} {\mbi{x} }_{a} \cdot {\mmb{\theta } }+ \sum\limits \psi \left ({\mbi{x} }_{a} \cdot {\mmb{\theta } }\right),\eqno{\hbox{(18)}}$$ which is a convex function (strictly convex when Formula$n \geq p$). The optimal solution is given by the solution of Formula TeX Source $$\sum\limits \left \{y_{a}- {{\exp \left ({\mbi{x} }_{a} \cdot {\mmb{\theta } }\right)}\over{1+ \exp\left ({\mbi{x} }_{a} \cdot {\mmb{\theta } }\right)}}\right \} \cdot{\mbi{x} }_{a} = 0.\eqno{\hbox{(19)}}$$

SECTION III

INFORMATION GEOMETRY OF CONVEX OPTIMIZATION

A strictly convex smooth function induces a geometrical structure in a manifold Formula$M$. See information geometry [14]. It becomes a Riemannian manifold, where a Riemannian metric tensor Formula$G=\left (g_{ij}\right)$ is defined by the Hessian of Formula$\varphi$, Formula TeX Source $$G({\mmb{\theta } }) = \nabla \nabla \varphi ({\mmb{\theta } }).\eqno{\hbox{(20)}}$$ The squared length of a small line element Formula$d{\mmb{\theta } }$ is given by the quadratic form Formula TeX Source $$ds^{2} = \langle d {\mmb{\theta } }, d{\mmb{\theta } } \rangle= d {\mmb{\theta } }^{\prime}G({\mmb{\theta } })d{\mmb{\theta } } = \sum\limits g_{ij} d \theta _{i} d \theta_{j}.\eqno{\hbox{(21)}}$$ When there are two small line elements Formula$d_{1} {\mmb{\theta } }$ and Formula$d_{2}{\mmb{\theta } }$, their inner product at Formula${\mmb{\theta } }$ is given by Formula TeX Source $$\langle d_{1} {\mmb{\theta } }, d_{2} {\mmb{\theta } } \rangle =d_{1} {\mmb{\theta } }^{\prime} G ({\mmb{\theta } }) d_{2} {\mmb{\theta } },\eqno{\hbox{(22)}}$$ and they are orthogonal when it vanishes.

A divergence function Formula$D[P:Q]$, called the Bregman divergence, is introduced between two points Formula$P$ and Formula$Q$ in Formula$M$ based on Formula$\varphi ({\mmb{\theta } })$. It is defined as Formula TeX Source $$D [P:Q] = \varphi \left ({\mmb{\theta } }_{P} \right)-\varphi \left ({\mmb{\theta } }_{Q} \right)- \nabla \varphi\left ({\mmb{\theta } }_{Q} \right) \cdot \left ({\mmb{\theta } }_{P}-{\mmb{\theta } }_{Q} \right),\eqno{\hbox{(23)}}$$ where Formula${\mmb{\theta } }_{P}$ and Formula${\mmb{\theta } }_{Q}$ are the coordinates of Formula$P$ and Formula$Q$, respectively. See Fig. 1. The divergence is non-negative, and is equal to 0 when and only when Formula$P=Q$. It is not a distance, because it is not symmetric in general, that is Formula$D[P:Q] = D[Q:P]$ does not hold in general. When Formula$P$ is infinitesimally close to Formula$Q, Q=P+dP$, let Formula${\mmb{\theta } }$ and Formula${\mmb{\theta } }+ d{\mmb{\theta } }$ be their coordinates. Then, the Taylor expansion proves that their divergence is related to the Riemannian metric, Formula TeX Source $$D[P: P+dP] = {{1}\over {2}} \sum\limits g_{ij} d \theta _{i} d \theta _{j}.\eqno{\hbox{(24)}}$$

Figure 1
Fig. 1. Bregman divergence derived from Formula$\phi (\theta)$.

When an affine connection is introduced in Formula$M$, it defines “straightness” of a curve. A straight curve is called a geodesic. Two dually coupled affine connections and related covariant derivatives are naturally introduced in Formula$M$ by using a divergence function Formula$D$ [14], [15]. Hence, two types of geodesics are defined by the two affine connections. However, we avoid to state details of differential geometry. We state only that a manifold Formula$M$ having convex Formula$\varphi ({\mmb{\theta } })$ defines two affine connections which are flat. Since they are flat, we have two special affine coordinate systems. They are affine in the respective senses, and the geodesics are given as linear curves in the respective coordinate systems.

One affine coordinate system is Formula${\mmb{\theta } }$ itself in terms of which a convex function Formula$\varphi$ is defined. The other is its Legendre transform Formula TeX Source $${\mmb{\eta } } = \nabla \varphi ({\mmb{\theta } }),\eqno{\hbox{(25)}}$$ where Formula${\mmb{\theta } }$ and Formula${\mmb{\eta } }$ are in one-to-one correspondence. It is easy to see from (20) that the Riemannian metric is given by Formula TeX Source $$g_{ij}({\mmb{\theta } }) = {{\partial \eta _{i} ({\mmb{\theta } })}\over{\partial\theta _{j}}}.\eqno{\hbox{(26)}}$$

The Legendre duality guarantees the existence of another convex function of Formula${\mmb{\eta } }$, defined by Formula TeX Source $$\varphi ^{\ast }({\mmb{\eta } }) = \mathop {\max }_{\mmb{\theta } }\left \{{\mmb{\theta } } \cdot {\mmb{\eta } } - \varphi ({\mmb{\theta } })\right \}.\eqno{\hbox{(27)}}$$ The Formula${\mmb{\theta } }$ coordinates are recovered by its gradient, Formula TeX Source $${\mmb{\theta } } = \nabla \varphi ^{\ast }({\mmb{\eta } }).\eqno{\hbox{(28)}}$$ Another Bregman divergence Formula$D^{\ast }$ is defined in the coordinates Formula${\mmb{\eta } }$ by using the dual convex function Formula$\varphi ^{\ast }$ as Formula TeX Source $$D^{\ast }[P:Q] = \varphi ^{\ast } \left ({\mmb{\eta } }_{P} \right)-\varphi ^{\ast } \left ({\mmb{\eta } }_{Q} \right) -\nabla \varphi ^{\ast }\left ({\mmb{\eta } }_{Q} \right) \cdot \left ({\mmb{\eta } }_{P}-{\mmb{\eta } }_{Q} \right).\eqno{\hbox{(29)}}$$ However, we can prove that it satisfies Formula TeX Source $$D^{\ast }[P : Q] = D[Q : P]\eqno{\hbox{(30)}}$$ so that they are substantially the same, except for the order of points. The divergence is written concisely by using both Formula${\mmb{\theta } }$ and Formula${\mmb{\eta } }$ coordinates, Formula TeX Source $$D[P:Q] = \varphi \left ({\mmb{\theta } }_{P} \right) +\varphi ^{\ast } \left ({\mmb{\eta } }_{Q} \right)-{\mmb{\theta } }_{P}\cdot {\mmb{\eta } }_{Q}\eqno{\hbox{(31)}}$$ where Formula${\mmb{\theta } }_{P}$ and Formula${\mmb{\theta } }_{Q}$ are the Formula${\mmb{\theta } }$ coordinates of Formula$P$ and Formula$Q$, and Formula${\mmb{\eta } }_{P}$ and Formula${\mmb{\eta } }_{Q}$ are the Formula${\mmb{\eta } }$ coordinates of Formula$P$ and Formula$Q$, respectively.

Since the Jacobian of the coordinate transformation from Formula${\mmb{\eta } }$ to Formula${\mmb{\theta } }$ is given by (26), the Jacobian of the reverse transformation is Formula TeX Source $$g^{\ast }_{ij} = {{\partial \theta _{i}}\over {\partial \eta _{j}}},\eqno{\hbox{(32)}}$$ which is the inverse of Formula$G=\left (g_{ij}\right)$. Hence, Formula$G^{\ast } =G^{-1}= \left (g^{\ast }_{ij}\right)$ is the Riemannian metric tensor expressed in the coordinate system Formula${\mmb{\eta } }$.

Let us consider a curve Formula${\mmb{\theta } }(t)$ parameterized by Formula$t$. Its tangent vector at Formula$t$ is given by Formula TeX Source $${\mathdot {\mmb{\theta }}}(t) = \sum\limits {\mathdot {\theta}}_{i} (t){\mbi{e} }_{i},\eqno{\hbox{(33)}}$$ where Formula${\mathdot {\theta}}_{i} = (d/dt) \theta _{i}(t)$ and Formula${\mbi{e} }_{i}$ is the tangent vector along the coordinate axis Formula$\theta _{i}$. Since Formula${\mmb{\theta } }$ is an affine coordinate system, a geodesic (Formula${\mmb{\theta } }$-geodesic) is a curve written as Formula TeX Source $${\mmb{\theta } }(t) = t{\mbi{a} }+ {\mbi{b} }\eqno{\hbox{(34)}}$$ in the Formula${\mmb{\theta } }$ coordinate system, where Formula${\mbi{a} }$ and Formula${\mbi{b} }$ are constant vectors.

Dually, we can express a curve in the η coordinate system, which is expressed as Formula${\mmb{\eta } }(t)$. By using the tangent vector Formula${\mbi{e} }^{\ast }_{i}$ of the coordinate axis Formula$\eta _{i}$, the tangent vector of Formula${\mmb{\eta } }(t)$ is written as Formula TeX Source $${\mathdot {\mmb{\eta }}}(t) = \sum\limits {\mathdot {\eta}}_{i} (t){\mbi{e} }^{\ast }_{i}.\eqno{\hbox{(35)}}$$ A dual geodesic which we call an Formula${\mmb{\eta } }$-geodesic, is written as Formula TeX Source $${\mmb{\eta } }(t) = t{\mbi{a} }^{\ast }+ {\mbi{b} }^{\ast}\eqno{\hbox{(36)}}$$ in the Formula${\mmb{\eta } }$ coordinate system. This is not a Formula${\mmb{\theta } }$-geodesic in general.

The Riemannian metric Formula$G$ is given by the inner products of basis vectors, Formula TeX Source $$g_{ij}({\mmb{\theta } }) = \langle {\mbi{e} }_{i}, {\mbi{e} }_{j} \rangle.\eqno{\hbox{(37)}}$$ Similarly, Formula$G^{\ast }$ is given by Formula TeX Source $$g^{\ast }_{ij} = \langle {\mbi{e} }^{\ast }_{i}, {\mbi{e} }^{\ast}_{j} \rangle ,\eqno{\hbox{(38)}}$$ From (37), (38) and Formula$G^{\ast }=G^{-1}$, we see that the two bases of tangent vectors are related by Formula TeX Source $${\mbi{e} }_{i} = \sum\limits g_{ij} {\mbi{e} }^{\ast }_{j}, \quad{\mbi{e} }^{\ast }_{j} = \sum\limits g^{\ast }_{ji}{\mbi{e} }_{i}.\eqno{\hbox{(39)}}$$ We have an important result from this. See Fig. 2.

Figure 2
Fig. 2. Orthogonality of two coordinate axes.

Theorem 1

The two affine coordinate systems Formula${\mmb{\theta } }$ and Formula${\mmb{\eta } }$ are reciprocal, that is, their tangent vectors Formula${\mbi{e} }_{i}$ and Formula${\mbi{e} }^{\ast }_{j}$ are orthogonal at any points when Formula$i \ne j$, Formula TeX Source $$\langle {\mbi{e} }_{i}, {\mbi{e} }^{\ast }_{j} \rangle = \delta _{ij},\eqno{\hbox{(40)}}$$ where Formula TeX Source $$\delta _{ij} = \cases{1, \hfill & $i=j$, \hfill \cr 0, \hfill & $i \ne j$.\hfill }\eqno{\hbox{(41)}}$$

Now we state a fundamental theorem of a dually flat manifold.

2) Pythagorian Theorem

For three points Formula$P$, Formula$Q$ and Formula$R$, when Formula${\mmb{\theta } }$-geodesic connecting Formula$P$ and Formula$Q$ is orthogonal to Formula${\mmb{\eta } }$-geodesic connecting Formula$Q$ and Formula$R$, Formula TeX Source $$D[P:Q] + D[Q:R] = D[P:R].\eqno{\hbox{(42)}}$$ Dually, when Formula${\mmb{\eta } }$-geodesic connecting Formula$P$ and Formula$Q$ is orthogonal to Formula${\mmb{\theta } }$-geodesic connecting Formula$Q$ and Formula$R$, Formula TeX Source $$D^{\ast } [P:Q]+D^{\ast }[Q:R] = D^{\ast }[P:R].\eqno{\hbox{(43)}}$$

See Fig. 3. We finally define a geodesic projection in a dually flat manifold. Let Formula$S$ be a smooth submanifold in a dually flat manifold Formula$M$ and let Formula$P$ be a point outside Formula$S$. Let us consider a point Formula$Q \in S$. When the Formula${\mmb{\eta } }$-geodesic connecting Formula$P$ and Formula$Q$ is orthogonal to Formula$S$, the point Formula$Q$ is called the Formula${\mmb{\eta } }$-projection of Formula$P$ to Formula$S$. (We can define the Formula${\mmb{\theta } }$-projection similarly.) We now have the Formula${\mmb{\eta } }$-projection theorem from the Pythagorian theorem.

Figure 3
Fig. 3. Generalized Pythagorean theorem.

3) Projection Theorem

For a smooth submanifold Formula$S$ in a dually flat manifold Formula$M$ and a point Formula$P$ outside Formula$S$, the point in Formula$S$ that minimizes divergence Formula$D^{\ast }[P:Q]$, Formula$Q \in S$, is the Formula${\mmb{\eta } }$-projection of Formula$P$ to Formula$S$. Moreover, when Formula$S$ is convex in Formula${\mmb{\theta } }$ coordinates, the Formula${\mmb{\eta } }$-projection exists and is unique.

In the special case when Formula$\varphi ({\mmb{\theta } })$ is a quadratic function, the Riemann metric Formula$G$ does not depend on Formula${\mmb{\theta } }$, so that it is a constant tensor. The manifold is Euclidean from the point of view of the Riemannian metric. The two affine coordinates are linearly related in this case, Formula TeX Source $${\mmb{\eta } } = G{\mmb{\theta } }+ {\mbi{c} }\eqno{\hbox{(44)}}$$ for constant Formula${\mbi{c} }$. Hence, a Formula${\mmb{\theta } }$-geodesic is an Formula${\mmb{\eta } }$-geodesic at the same time. So two types of geodesics are identical, and they are merely a Euclidean geodesic. But each of the coordinate systems Formula${\mmb{\theta } }$ and Formula${\mmb{\eta } }$ is not orthogonal but oblique. They are mutually orthogonal, that is, reciprocal systems. However, when Formula$\varphi ({\mmb{\theta } })$ is not a quadratic function, Formula$G$ depends on Formula${\mmb{\theta } }$ and the manifold is not Euclidean, although it has two mutually dual affine structures from the point of view of information geometry.

SECTION IV

GEOMETRY OF Formula$L_{1}$ OPTIMIZATION

LASSO obtains a solution that minimizes a quadratic function Formula$\varphi ({\mmb{\theta } })$ under the Formula$L_{1}$ constraint. We consider here an over-determined case since it is easier to explain the geometrical structure. However, our algorithm works in the under-determined case as well. Let Formula${\mmb{\theta } }^{opt}_{c}$ be the optimal solution of the Problem Formula$P_{c}$. Since the constraint region specified by Formula$c$ Formula TeX Source $$R_{c} = \left \{{\mmb{\theta } } \left \vert \; \sum\limits \left \vert \theta _{i}\right\vert\leq c \right.\right \}\eqno{\hbox{(45)}}$$ is a Formula${\mmb{\theta } }$-convex set, the optimal solution is unique and is given by the projection of Formula${\mmb{\theta } }^{opt}$ to the boundary Formula$B_{c}$ of polyhedron Formula$R_{c}$. Note that Formula$B_{c}$ is the boundary of Formula$R_{c}$ and is piecewise linear, including non-differentiable points such as vertices, edges, and higher-dimensional subfaces.

It is well known in a Euclidean space that projection of Formula${\mmb{\theta } }^{opt}$ to a smooth convex region encircled by Formula$S({\mmb{\theta } })=c$ is the point Formula${\mmb{\theta } }_{S}^{opt} \in S$ such that the vector Formula${\mmb{\theta } }^{opt}-{\mmb{\theta } }^{opt}_{S}$ connecting Formula${\mmb{\theta } }^{opt}$ and Formula${\mmb{\theta } }^{opt}_{S}$ is orthogonal to Formula$S$, Formula TeX Source $${\mmb{\theta } }^{opt}-{\mmb{\theta } }^{opt}_{S} \propto G^{-1} \nabla S\left ({\mmb{\theta } }^{opt}_{S} \right).\eqno{\hbox{(46)}}$$ Here, Formula$G$ is a constant tensor given by Formula$g_{ij}= \langle {\mbi{e} }_{i},{\mbi{e} }_{j} \rangle$ and is not necessarily the identity matrix if Formula${\mmb{\theta } }$ is not orthonormal but oblique. Hence, Formula$G^{-1}\nabla S({\mmb{\theta } })$ denotes the normal vector of Formula$S$. In the present case, Formula$B_{c}$ is not differentiable but is piecewise differentiable. When Formula${\mmb{\theta } }^{opt}_{B_{c}}$ belongs to a hypersurface of Formula$B_{c}$, (46) is satisfied. But when it belongs to a subface (say an edge of Formula$B_{c}$), the projection is defined by using the subgradient, instead of the gradient. We explain this.

Function Formula$F({\mmb{\theta } })$ is not differentiable when some of Formula$\theta _{i}=0$. Non-differentiable positions sit in subfaces of the polyhedron, where some Formula$\theta _{i}=0$. In order to show which subface Formula${\mmb{\theta } }$ belongs to, we define an active set of indices for each Formula${\mmb{\theta } }$, Formula TeX Source $$A({\mmb{\theta } }) = \left \{i \left \vert \; \theta _{i} \ne 0 \right.\right\},\quad A \subset N = \left \{1, 2,\ldots, p \right \}.\eqno{\hbox{(47)}}$$ Then, Formula$F({\mmb{\theta } })$ is not differentiable at points of which active sets are not equal to Formula$N$.

A subgradient Formula$\nabla F({\mmb{\theta } })$ is used in convex analysis when Formula$F$ is not differentiable (Bertsekas [17], Boyd and Vandenberghe [18]). It is a normal vector of one of supporting hypersurfaces of Formula$F$, and is written in the component form as Formula TeX Source $$\left (\nabla F \right)_{i} = \cases{{{\partial }\over {\partial \theta _{i}}}F({\mmb{\theta } }) = {\rm sgn} \; \theta _{i}, \hfill & $i \in A$, \hfill \cr \left (\nabla F \right)_{i} \in \left [-1, 1 \right ], \hfill & $i \in \bar {A} = N-A$,\hfill }\eqno{\hbox{(48)}}$$ in the present case of (1). Here, the Formula$i$-th component Formula$\left (\nabla F \right)_{i}$ of a subgradient Formula$\nabla F$ is the ordinary gradient for Formula$i \in A$ and is equal to the signature of Formula$\theta _{i}$, but is any value in the interval Formula$\left [-1, 1\right ]$ for Formula$i \in \bar {A}$ where Formula$\theta _{i}=0$. See Fig. 4. The set of all subgradients at each point Formula${\mmb{\theta } }$ is denoted by Formula$\partial F ({\mmb{\theta } })$ and is called the subdifferential of Formula$F$ at Formula${\mmb{\theta } }$ [17].

Figure 4
Fig. 4. Gradient vector, subgradient vectors and subdifferential.

In the present case where Formula$\varphi$ is not necessarily quadratic, by differentiating the Lagrangean formulation (3) with respect to Formula$\lambda$, the optimal solution Formula${\mmb{\theta } }^{opt}_{\lambda }$ satisfies Formula TeX Source $$\nabla \varphi \left ({\mmb{\theta } }^{opt}_{\lambda }\right) =-\lambda \nabla F \left ({\mmb{\theta } }_{\lambda }^{opt}\right).\eqno{\hbox{(49)}}$$ However, when the active set Formula$A\left ({\mmb{\theta } }^{opt}_{\lambda }\right)$ is not Formula$N$, the Formula$\nabla F$ is a subgradient. Since Formula TeX Source $${\mmb{\eta } }^{opt} = \nabla \varphi \left ({\mmb{\theta } }^{opt}\right)=0,\eqno{\hbox{(50)}}$$ (49) is written as Formula TeX Source $${\mmb{\eta } }^{opt}-{\mmb{\eta } }^{opt}_{\lambda } = \lambda \nabla F\left ({\mmb{\theta } }^{opt}_{\lambda }\right)\eqno{\hbox{(51)}}$$ in the Formula${\mmb{\eta } }$-coordinates. This shows the Formula${\mmb{\eta } }^{opt}_{\lambda }$ is the Formula${\mmb{\eta } }$-projection of Formula${\mmb{\eta } }^{opt}$ to Formula$B_{c}$, where Formula$\lambda = \lambda (c)$ is determined from Formula$c$.

Indeed, we have from (10), (30) and (31), Formula TeX Source $$D^{\ast } \left [{\mmb{\theta } }^{opt} : {\mmb{\theta } }\right] =\varphi ({\mmb{\theta } })-\varphi ^{\ast } \left ({\mmb{\eta } }^{opt}\right).\eqno{\hbox{(52)}}$$ Hence, minimizing Formula$\varphi ({\mmb{\theta } })$ on Formula$B_{c}$ is equivalent to minimizing Formula$D^{\ast } \left [{\mmb{\theta } }^{opt} : {\mmb{\theta } }\right]$ on Formula$B_{c}$. Information geometry shows that Formula${\mmb{\theta } }^{opt}_{\lambda }$ is the Formula${\mmb{\eta } }$-projection of Formula${\mmb{\theta } }^{opt}$ to Formula$B_{c}$.

The Formula${\mmb{\eta } }$-geodesic connecting Formula${\mmb{\eta } }_{c}$ and Formula${\mmb{\eta } }^{opt}$ is Formula TeX Source $${\mmb{\eta } }(t) = (1-t) {\mmb{\eta } }^{opt} + t {\mmb{\eta } }_{c}= t{\mmb{\eta } }_{c},\eqno{\hbox{(53)}}$$ and its tangent direction is Formula TeX Source $${\mathdot {\mmb{\eta }}}(t) = {\mmb{\eta } }_{c}.\eqno{\hbox{(54)}}$$ Hence, from the projection theorem, we have the optimality condition. See Fig. 5.

Figure 5
Fig. 5. Formula$\eta$-projection of Formula$\theta _{opt}$.

Theorem 2

The Formula${\mmb{\eta } }$-projection of Formula${\mmb{\theta } }^{opt}$ to Formula$B_{c}$ is Formula${\mmb{\eta } }^{opt}_{c}$ in the Formula${\mmb{\eta } }$-coordinates that satisfies Formula TeX Source $${\mmb{\eta } }^{opt}_{c} = -\lambda \nabla F \left ({\mmb{\theta } }^{opt}_{c} \right)\eqno{\hbox{(55)}}$$ for one of the subgradients.

We denote by Formula$\Pi _{c}$ the Formula${\mmb{\eta } }$-geodesic projection operator to Formula$B_{c}$, so that Formula TeX Source $$\Pi _{c} {\mmb{\theta } }^{opt} = {\mmb{\theta } }^{opt}_{c}.\eqno{\hbox{(56)}}$$

SECTION V

INVERSE PROJECTION

The optimal solution Formula${\mmb{\theta } }^{opt}_{c}$ on Formula$B_{c}$, that is, the Formula${\mmb{\eta } }$-geodesic projection of Formula${\mmb{\theta } }^{opt}$ to Formula$B_{c}$, is unique since Formula$R_{c}$ is Formula${\mmb{\theta } }$-convex. Consider the inverse projection. Given a point Formula${\mmb{\theta } }$ on Formula$B_{c}$, we define the set of points Formula${\mmb{\theta } }^{\prime}$ outside Formula$B_{c}$ such that the Formula${\mmb{\eta } }$-geodesic projection of Formula${\mmb{\theta } }^{\prime}$ to Formula$B_{c}$ is equal to Formula${\mmb{\theta } }$, Formula TeX Source $$\Pi ^{-1}_{c} {\mmb{\theta } } = \left \{{\mmb{\theta } }^{\prime} \left \vert\; \Pi _{c} {\mmb{\theta } }^{\prime} = {\mmb{\theta } }\right.\right\}.\eqno{\hbox{(57)}}$$ This is called the inverse Formula${\mmb{\eta } }$-projection of Formula${\mmb{\theta } }$. The inverse Formula${\mmb{\eta } }$-projection of Formula${\mmb{\theta } } \in B_{c}$ is written simply by using the dual coordinates, Formula TeX Source $$\Pi ^{-1}_{c} {\mmb{\eta } } = \left \{{\mmb{\eta } }^{\prime} \left \vert \;{\mmb{\eta } }^{\prime}-{\mmb{\eta } } = t \nabla F({\mmb{\eta } })\right.\right \}.\eqno{\hbox{(58)}}$$ When Formula${\mmb{\theta } }$ belongs to a face of Formula$B_{c}$, that is, when Formula$A({\mmb{\theta } })=N$, Formula$\nabla F({\mmb{\theta } })$ is unique so that Formula$\Pi ^{-1}_{c} {\mmb{\theta } }$ is the Formula${\mmb{\eta } }$-geodesic in the direction of Formula$\nabla F$ passing through Formula${\mmb{\theta } }$. In the Formula${\mmb{\eta } }$-coordinates, we have Formula TeX Source $${\mmb{\eta } }^{\prime} = {\mmb{\eta } }+ t \nabla F({\mmb{\theta } }).\eqno{\hbox{(59)}}$$ When Formula${\mmb{\theta } }$ belongs to a subface of which active set Formula$A({\mmb{\theta } })$ is not Formula$N$, Formula$\nabla F({\mmb{\theta } })$ is not unique. In this case, the inverse projection Formula${\mmb{\eta } }^{\prime}$ is written Formula TeX Source $$\Pi ^{-1}_{c} {\mmb{\eta } } =\left \{{\mmb{\eta } }^{\prime} = {\mmb{\eta } }+ t{\mbi{n} } \left \vert{\mbi{n} } \in \partial F({\mmb{\theta } }), \; t \geq 0\right. \right \}.\eqno{\hbox{(60)}}$$ See Fig. 6.

Figure 6
Fig. 6. Inverse geodesic projection.

For Formula$c^{\prime}< c$, let us consider how Formula$\Pi ^{-1}_{c}$ and Formula$\Pi ^{-1}_{c^{\prime}}$ are related. When Formula${\mmb{\theta } }_{c^{\prime}}$ is in a face of Formula$B_{c^{\prime}}$, i.e., Formula$A\left ({\mmb{\theta } }_{c^{\prime}}\right) = N$, if the Formula${\mmb{\eta } }$-geodesic orthogonal to Formula$B_{c^{\prime}}$, passes through Formula${\mmb{\theta } }_{c} \in B_{c}$, the geodesic Formula$\Pi ^{-1}_{c} {\mmb{\theta } }_{c}$ is a part of Formula$\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}}$ (Fig. 7). Hence, Formula TeX Source $$\Pi ^{-1}_{c} {\mmb{\theta } }_{c} \subset\Pi ^{-1}_{c^{\prime}}{\mmb{\theta } }_{c^{\prime}}.\eqno{\hbox{(61)}}$$ When Formula${\mmb{\theta } }_{c^{\prime}}$ lies on a subface specified by Formula$A \;(\ne N)$, its inverse image forms an Formula${\mmb{\eta } }$-flat cone defined in the Formula${\mmb{\eta } }$-coordinates by Formula TeX Source $$\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}} = \left \{{\mmb{\eta } } \left \vert \; {\mmb{\eta } }-{\mmb{\eta} }_{c^{\prime}} = t \nabla F\left ({\mmb{\theta } }_{c^{\prime}} \right), \; t \geq 0 \right.\right \}.\eqno{\hbox{(62)}}$$ Let Formula${\mmb{\theta } }_{c}$ be a point in Formula$B_{c}$ included in Formula$\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}}$. When, Formula$A\left ({\mmb{\theta } }_{c} \right) = A\left ({\mmb{\theta } }_{c^{\prime}}\right)$, the subdifferentials are identical, Formula$\partial F \left ({\mmb{\theta } }_{c}\right) = \partial F\left ({\mmb{\theta } }_{c^{\prime}}\right)$. Then, Formula$\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}}$ is a Formula${\mmb{\eta } }$-parallel transport of Formula$\Pi ^{-1}_{c} {\mmb{\theta } }_{c}$ (Fig. 7). Hence, we have the inclusion theorem.

Figure 7
Fig. 7. Inclusion theorem of inverse projection.

Theorem 3 (Inclusion Theorem)

Let Formula${\mmb{\theta } }_{c}$ and Formula${\mmb{\theta } }_{c^{\prime}} \;\left (c^{\prime} < c \right)$ be two points satisfying Formula${\mmb{\theta } }_{c}\in \Pi ^{-1}_{c^{\prime}}{\mmb{\theta } }_{c^{\prime}}$. Then Formula TeX Source $$\Pi ^{-1}_{c} {\mmb{\theta } }_{c} \subset\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}},\quad c^{\prime}< c.\eqno{\hbox{(63)}}$$

The theorem implies that, as Formula$c$ decreases, the region of cone Formula$\Pi ^{-1}_{c} {\mmb{\theta } }_{c}$ increases monotonically. When Formula$A\left ({\mmb{\theta } }_{c^{\prime}} \right)$ is not full, the inverse image Formula$\Pi ^{-1}_{c^{\prime}}{\mmb{\theta } }_{c^{\prime}}$ includes not only Formula$\Pi ^{-1}_{c}{\mmb{\theta } }_{c}$ but also some of Formula$\Pi ^{-1}_{c}{\mathtilde {\mmb{\theta }}}_{c}$, where the active set Formula$A\left ({\mathtilde {\mmb{\theta }}}_{c} \right)$ is properly larger than Formula$A\left ({\mmb{\theta } }_{c} \right)$. This is a property of LARS [10] and the extended LARS [12].

Theorem 4

Given Formula${\mmb{\theta } }^{opt}$, the active set Formula$A\left ({\mmb{\theta } }_{c}^{opt}\right)$ monotonically decreases as Formula$c$ decreases, letting Formula${\mmb{\theta } }^{opt}_{c}$ sparser and sparser.

SECTION VI

SOLUTION PATH

We analyze the Formula${\mmb{\eta } }$-projection Formula${\mmb{\eta } }^{opt}_{c}$ as a function of Formula$c$. It forms a solution path, connecting Formula${\mmb{\eta } }^{opt}_{c} = 0$ for Formula$c=0$ and Formula${\mmb{\eta } }^{opt}_{c} ={\mmb{\eta } }^{opt}$ for sufficiently large Formula$c$. When Formula${\mmb{\eta } }^{opt}_{c}$ belongs to a face of Formula$B_{c}$, that is, the active set Formula$A$ is Formula$N$, Formula${\theta }^{opt}_{c, i} \ne 0$ for all Formula$i$. Hence, we have from (55) Formula TeX Source $$\eta ^{opt}_{\lambda, i} = -\lambda s_{i},\eqno{\hbox{(64)}}$$ where Formula TeX Source $$s_{i} = {{\partial }\over {\partial \theta _{i}}} F ({\mmb{\theta } }) = {\rm sgn}\left (\theta _{i} \right).\eqno{\hbox{(65)}}$$

When the projection falls in a lower-dimensional subface of which the active set is Formula$A$, Formula${\mmb{\eta } }^{opt}_{\lambda}$ is equal to a subgradient, Formula$-\lambda \nabla F \left ({\mmb{\theta } }^{opt}_{\lambda } \right)$. We first analyze a solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$ or Formula${\mmb{\eta } }^{opt}_{\lambda }$ in the interval where the active set Formula$A$ does not change. We partition and rearrange components of Formula${\mmb{\theta } }$ and Formula${\mmb{\eta } }$ according to their indices belonging to Formula$A$ or Formula$\bar {A}$, as Formula TeX Source $${\mmb{\theta } } = \left ({\mmb{\theta } }^{A}, {\mmb{\theta} }^{\bar {A}}\right),\quad{\mmb{\eta } } = \left ({\mmb{\eta } }^{A}, {\mmb{\eta} }^{\bar {A}}\right),\eqno{\hbox{(66)}}$$ where Formula${\mmb{\theta } }^{A}$ is a subvector of Formula${\mmb{\theta } }$ of which components are nonzero and their indices belonging to Formula$A$. The other notations are similarly understood. We further define a signature vector Formula${\mbi{s} }^{A} = \left (s^{A}_{i} \right)$, Formula$i \in A$, by Formula TeX Source $$s^{A}_{i} = {\rm sgn}\; \theta _{i}, \quad i \in A.\eqno{\hbox{(67)}}$$ Then, we have the equations to determine Formula${\mmb{\theta } }^{opt}_{\lambda }$ and Formula${\mmb{\eta } }^{opt}_{\lambda }$ in terms of the partitioned forms of Formula$A$ and Formula$\bar {A}$ components, Formula TeX Source $$\eqalignno{{\mmb{\eta } }^{opt, A}_{\lambda } =&\, -\lambda {\mbi{s} }^{A}, & {\hbox{(68)}}\cr{\mmb{\theta } }^{opt, \bar {A}}_{\lambda } =&\, 0, & {\hbox{(69)}}\cr\sum\limits \left \vert \theta ^{A}_{i} \right \vert =&\, c.& {\hbox{(70)}}}$$ The other components, Formula$\mmb{\theta } ^{opt, A}_{\lambda }$ and Formula${\mmb{\eta } }^{opt, \bar {A}}_{\lambda }$, are determined from the Formula${\mmb{\theta } }$-Formula${\mmb{\eta } }$ correspondence (25), (28). Since Formula${\mmb{\eta } }^{opt}_{\lambda }$ is a subgradient Formula$\nabla F$, we note that Formula TeX Source $$\left \vert {\eta }^{opt, \bar {A}}_{\lambda, i} \right \vert \leq \lambda =\left \vert \eta ^{opt, A}_{\lambda, i} \right \vert.\eqno{\hbox{(71)}}$$

The solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$ (Formula${\mmb{\eta } }^{opt}_{\lambda }$ in the Formula$\eta$-coordinates) is continuous and is piecewise differentiable. It changes the direction discontinuously when its active set alters. It is differentiable while Formula$A$ does not change. We consider the path Formula${\mmb{\theta } }^{opt}_{\lambda }$ inside a subface specified by Formula$A$. By differentiating (68) and (69) with respect to Formula$\lambda$, we have the following lemma.

Lemma 1

The solution path inside a fixed subface specified by Formula$A$ satisfies Formula TeX Source $$\eqalignno{{\mathdot {\mmb{\theta }}}^{opt, A}_{\lambda } =&\, -G^{-1}_{AA} {\mbi{s} }^{A}, & {\hbox{(72)}}\cr{\mathdot {\mmb{\theta }}}^{opt, \bar {A}}_{\lambda } =&\, 0,& {\hbox{(73)}}}$$ where Formula${\mathdot {\mmb{\theta }}}_{\lambda }= (d/d\lambda){\mmb{\theta } }_{\lambda }$ and Formula$G_{AA}$ is the submatrix of Formula$G$ corresponding to indices in Formula$A$.

Proof

Due to (26), small changes of Formula${\mmb{\theta } }$ and Formula${\mmb{\eta } }$ are related by Formula TeX Source $$d{\mmb{\eta } } = G d {\mmb{\theta } }.\eqno{\hbox{(74)}}$$ Since Formula$d{\mmb{\theta } }^{opt, \bar {A}}_{\lambda }=0$ on the path, the Formula$A$-part of Formula$d{\mmb{\eta } }^{opt, A}_{\lambda }$ is Formula TeX Source $$d{\mmb{\eta } }^{opt, A}_{\lambda } = G_{AA}d{\mmb{\theta } }^{opt, A}_{\lambda},\eqno{\hbox{(75)}}$$ or Formula TeX Source $$d{\mmb{\theta } }^{opt, A}_{\lambda } = \left (G_{AA}\right)^{-1}d{\mmb{\eta } }^{opt, A}_{\lambda }.\eqno{\hbox{(76)}}$$ Hence, by differentiating (68), we have (72), and by differentiating (69), we have (73). Formula$\hfill \square$

We study the angle between Formula${\mathdot {\mmb{\theta }}}^{opt}_{\lambda }$ and the coordinate axis Formula$\theta _{i}$ of which the tangent vector is Formula${\mbi{e} }_{i}$. The cosine of the angle is given by the inner product Formula$\langle{\mathdot {\mmb{\theta }}}^{opt}_{\lambda }, {\mbi{e} }_{i} \rangle$ divided by Formula$\Vert{\mathdot {\mmb{\theta }}}^{opt}_{\lambda }\Vert$. We observe the following remarkable characteristic feature of the solution path inside a subface, which is a characteristic of LARS [10] and the extended LARS [12].

Theorem 5 (Least Equiangular Property)

The solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$ has the least equiangular property that Formula${\mathdot {\mmb{\theta }}}^{opt}_{\lambda }$ has the same angle to any Formula$\theta _{i}$-coordinates belonging to Formula$A$ and the angles to the other Formula$\theta _{i}$ coordinates Formula$\left (i \in \bar {A} \right)$ are larger.

Proof

The inner product of Formula${\mathdot {\mmb{\theta }}}^{opt}_{\lambda }$ and Formula${\mbi{e} }_{i}$ is given by Formula TeX Source $$\langle {\mathdot {\mmb{\theta }}}^{opt}_{\lambda }, {\mbi{e} }_{i} \rangle=\left (G {\mathdot {\mmb{\theta }}}^{opt}_{\lambda }\right)^{\prime} {\mbi{e} }_{i}= {\mathdot {\mmb{\eta }}}^{opt \prime }_{\lambda }{\mbi{e} }_{i}.\eqno{\hbox{(77)}}$$ From Formula TeX Source $$\eqalignno{{\mathdot {\eta}}^{opt}_{\lambda, i} =&\, -\lambda s^{A}_{i},\quad {\rm for} \quad i \in A, & {\hbox{(78)}}\cr{\mathdot {\eta}}^{opt}_{\lambda, i} \in&\, -\lambda [{-1, 1}], \quad{\rm for}\quad i \in \bar {A},& {\hbox{(79)}}}$$ we have Formula TeX Source $$\left \vert \langle {\mathdot {\mmb{\theta }}}^{opt}_{\lambda }, {\mbi{e} }_{i} \rangle \right \vert =\lambda \geq \left \vert \langle {\mathdot {\mmb{\theta }}}^{opt}_{\lambda },{\mbi{e} }_{j} \rangle \right \vert ,\quad i \in A, j \in \bar {A}.\eqno{\hbox{(80)}}$$Formula$\hfill \square$

The above theorem characterizes the solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$. See Fig. 8. The direction changes discontinuously when the active set Formula$A$ is altered. It occurs when some Formula$\theta _{i} = 0$ becomes Formula$\theta _{i} \ne 0$ or Formula$\theta _{i} \ne 0$ becomes Formula$\theta _{i} = 0$.

SECTION VII

MINKOVSKIAN GRADIENT METHOD

The solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$ is understood from the point of view of gradient descent of function Formula$\varphi ({\mmb{\theta } })$ or Formula$D^{\ast } \left [{\mmb{\theta } }^{opt}: {\mmb{\theta } }\right]$. This characterizes LARS, which starts at Formula${\mmb{\theta } }=0$ and approaches Formula${\mmb{\theta } }^{opt}$ as Formula$\lambda$ decreases. To this end, we introduce a new notion of Minkovskian gradient.

Figure 8
Fig. 8. Equianglular property solution path.

We first define a generalized gradient in a manifold in which a Minkovskian norm is given. (see, e.g., [19] for the Minkovskii and Finsler space.) The ordinary gradient of Formula$f({\mmb{\theta } })$ in a Euclidean space is Formula TeX Source $${\mbi{a} } = \nabla f({\mmb{\theta } }) = \left ({{\partial }\over {\partial \theta _{1}}}f,\ldots ,{{\partial }\over {\partial \theta _{p}}}f \right),\eqno{\hbox{(81)}}$$ representing the steepest direction of Formula$f({\mmb{\theta } })$ provided Formula${\mmb{\theta } }$ is orthonormal. Other wise, the steepest direction is given by the natural gradient Formula$G^{-1} \nabla f$. We consider how Formula$f(\mmb{\theta })$ changes, when Formula${\mmb{\theta } }$ changes from Formula${\mmb{\theta } }$ in direction Formula${\mbi{a} }$ in a Minkovskian space, where the norm of Formula${\mbi{a} }$ is given by the Formula$L_{q}$-norm, Formula TeX Source $$F_{q}({\mbi{a} }) = {{1}\over {q}} \sum\limits \left \vert a_{i} \right \vert ^{q}, \quad q>1.\eqno{\hbox{(82)}}$$ The steepest direction Formula${\mbi{a} }$ of Formula$f({\mmb{\theta } })$ at Formula${\mmb{\theta } }$ is defined by Formula TeX Source $${\mbi{a} } = {\mathop {\lim }_{\varepsilon \rightarrow 0}} \arg \max \left\vert f({\mmb{\theta } }+ \varepsilon {\mbi{a} })\right \vert\eqno{\hbox{(83)}}$$ under the condition Formula TeX Source $$F_{q} ({\mbi{a} }) = 1.\eqno{\hbox{(84)}}$$ By the variational method, we have Formula TeX Source $${{\partial }\over {\partial {\mbi{a}}}} \left \{\nabla f({\mmb{\theta } }) \cdot {\mbi{a} } -\lambda F_{q} ({\mbi{a} })\right \} =0,\eqno{\hbox{(85)}}$$ for obtaining the steepest direction Formula${\mbi{a} }$, which gives Formula TeX Source $${{\partial }\over {\partial a_{i}}} F_{q} ({\mbi{a} }) \propto{{\partial }\over {\partial \theta _{i}}}f({\mmb{\theta } }).\eqno{\hbox{(86)}}$$ Hence, Formula TeX Source $$a_{i} = \left [ c \left \vert {{\partial }\over {\partial \theta _{i}}}f({\mmb{\theta } })\right \vert \right ]^{{1}/{(q-1)}} {\rm sgn}\left ({{\partial }\over {\partial \theta _{i}}} f ({\mmb{\theta } }) \right),\eqno{\hbox{(87)}}$$ where Formula$c$ is a constant. In the Euclidean case of Formula$q=2$, this gives the ordinary gradient, Formula TeX Source $${\mbi{a} } = \nabla f({\mmb{\theta } }).\eqno{\hbox{(88)}}$$

We define the Minkovskian gradient in the Formula$L_{1}$-norm case, by taking the limit Formula$q \rightarrow 1$. As Formula$q \rightarrow 1$, by putting Formula TeX Source $$c = {{1}\over {\max \left \{{{\partial }\over {\partial \theta _{i}}} f\right \}}},\eqno{\hbox{(89)}}$$Formula$a_{i}$ becomes 0 except for those Formula$i$ of which the absolute values of Formula$(\partial / \partial \theta _{i})f$ are maximal. Hence, we have Formula TeX Source $$\eqalignno{& a_{i} \cr&= \!\cases{\! {\rm sgn} \!\left (\!{{\partial }\over {\partial \theta _{i}}} f\! \right)\!, \!& for $i$ such that $\left \vert {{\partial }\over {\partial \theta _{i}}}f \right \vert \! =\! {{\mathop {\max }_{j}}} \left \{\left \vert {{\partial}\over {\partial \theta _{j}}} f\right \vert \right \}$, \hfill \cr\! 0, \hfill & otherwise.}\cr&&{\hbox{(90)}}}$$ We call this Formula$L_{1}$-Minkovskian gradient denoted by Formula$\nabla ^{M}_{1}f({\mmb{\theta } })$. Note the Formula$\left \vert a_{i}\right \vert =1$ for Formula$i$ that gives the maximum of Formula$\left \vert \partial / \partial \theta _{j} f \right \vert$ and Formula$a_{i}=0$ for other Formula$i$.

We propose the Minkovskian steepest descent algorithm, starting at Formula${\mmb{\theta } }=0$, of tracing the minimum of Formula$\varphi ({\mmb{\theta } })$ as Formula$\lambda$ decreases or equivalently as Formula$c$ increases. Let Formula${\mmb{\theta } }^{(t)}$ and Formula${\mmb{\eta } }^{(t)}$ be the current values in the respective coordinates. Then, the next Formula${\mmb{\theta } }^{(t+1)}={\mmb{\theta } }^{(t)}+ \Delta {\mmb{\theta } }^{(t)}$ is given by the Formula$L_{1}$-Minkovskian gradient Formula TeX Source $$\Delta {\mmb{\theta } }^{(t)} = -\Delta t \nabla ^{M}_{1}\varphi \left ({\mmb{\theta } }^{(t)}\right),\eqno{\hbox{(91)}}$$ where Formula$\Delta t$ is an adequate step-size. Formula${\mmb{\eta } }^{(t+1)}$ is obtained from Formula${\mmb{\theta } }^{(t+1)}$. The following is the algorithm, which is the same as LARS when Formula$\varphi ({\mmb{\theta } })$ is a quadratic function. This is the reverse of the path pursuit of Hirase and Komaki [12] which starts at Formula${\mmb{\theta } }^{opt}$.

A. Minkovskian Gradient Algorithm

1) Start

Begin with Formula${\mmb{\theta } }^{(0)}=0$, Formula${\mmb{\eta } }^{(0)}=\nabla \varphi \left ({\mmb{\theta } }^{(0)}\right)$.

2) Active Set Formula$A$

For Formula$t=0, 1, 2,\ldots ,$ calculate Formula${\mmb{\eta } }^{(t)}= \nabla \varphi \left ({\mmb{\theta } }^{(t)}\right)$ and determine the maximum of Formula$\left \vert \eta ^{(t)}_{1}\right \vert ,\ldots, \left \vert \eta ^{(t)}_{p} \right \vert$. Let Formula TeX Source $${\mathop {\arg \max }_{j}} \left \vert \eta ^{(t)}_{i} \right \vert =\left \vert \eta ^{(t)}_{i^{\ast }_{1}}\right \vert\cdots = \left \vert \eta ^{(t)}_{i^{\ast }_{k}}\right \vert.\eqno{\hbox{(92)}}$$ Then, the active set is determined by Formula TeX Source $$A^{(t)} = \left \{i^{\ast }_{1},\ldots, i^{\ast }_{k} \right \}.\eqno{\hbox{(93)}}$$ The Formula$L_{1}$-Minkovskian gradient is Formula TeX Source $$\nabla ^{M}_{1} \varphi \left ({\mmb{\theta } }^{(t)}\right)_{i} =\cases{{\rm sgn} \left (\theta ^{(t)}_{i} \right), \hfill & $i \in A^{(t)}$, \hfill \cr 0, \hfill & $i \in \bar {A}^{(t)}$,\hfill }\eqno{\hbox{(94)}}$$

3) Calculation of Solution Path

Solve the equation of solution path stepwise, changing the current Formula${\mmb{\theta } }^{(t)}$ to Formula TeX Source $${\mmb{\theta } }^{(t)} \rightarrow {\mmb{\theta } }^{(t+1)}={\mmb{\theta } }^{(t)} + \Delta {\mmb{\theta } }^{(t)},\eqno{\hbox{(95)}}$$ by Formula TeX Source $$\eqalignno{\Delta {\mmb{\theta } }^{(t)A} =&\, G_{AA}^{-1}({\mmb{\theta } }^{(t)}){\mbi{s} }^{A} \Delta t,& {\hbox{(96)}}\cr\Delta {\mmb{\theta } }^{(t) \bar {A}}=&\, 0. & {\hbox{(97)}}}$$ This can be rewritten in terms of the Formula${\mmb{\eta } }$-coordinates as Formula TeX Source $$\eqalignno{{\mmb{\eta } }^{(t+1)A} =&\, {\mmb{\eta } }^{(t)A} + {\mbi{s} }^{A} \Delta t, & {\hbox{(98)}}\cr{\mmb{\eta } }^{(t+1) \bar {A}} =&\, {\mmb{\eta } }^{(t)\bar {A}} +G_{\bar {A}A} \Delta {\mmb{\theta } }^{(t)A}.& {\hbox{(99)}}}$$

4) Check of Turning Point

Check if Formula$\eta ^{(t+1)}_{i}$, Formula$i \in\bar {A}$ becomes equal to Formula$\left \vert \eta ^{(t+1)}_{j} \right \vert$, Formula$j \in A$. If this occurs for Formula$j^{\ast } \in \bar {A}$, add it to the active set, forming a new active set Formula TeX Source $$A^{(t+1)} = \left \{i^{\ast }_{i},\ldots, i^{\ast }_{k}, j^{\ast } \right \}.\eqno{\hbox{(100)}}$$

We now compare our Minkovskian gradient method with the original LARS [4], [10] and the extended LARS by Hirose and Komaki [12]. As can easily be seen, our algorithm is exactly the same as LARS and is applicable to a general convex Formula$\varphi ({\mmb{\theta } })$. Therefore, it is an extended LARS. When Formula$\varphi ({\mmb{\theta } })$ is a quadratic function, Formula$M$ is Euclidean so that an Formula${\mmb{\eta } }$-geodesic is a Formula${\mmb{\theta } }$-geodesic at the same time. Hence, we do not need step-wise calculations of Formula${\mmb{\theta } }^{(t)}$, Formula${\mmb{\eta } }^{(t)}$ while Formula$A$ is fixed. We use a large Formula$\Delta t$ until the next Formula${\mmb{\eta } }^{(t+1)}$ is a turning point of the solution path where Formula$A$ changes. This is a merit of LARS procedure when Formula$\varphi$ is quadratic.

The extended LARS [12] begins with Formula${\mmb{\theta } }^{opt}$, and traces the solution path (96) and (97) in the opposite direction by changing the sign of (96). This direction is equi-angular and is hence the opposite of the Minkovskian gradient. When some Formula$\theta _{i}$, say Formula$\theta _{i^{\ast}}$ becomes 0, index Formula$i^{\ast }$ is ruled out from the active set. The index Formula$i^{\ast }$ is determined by using the Pythagorean theorem of the Formula${\mmb{\eta } }$-projection. The procedure continues until Formula${\mmb{\theta } }$ reach the origin, Formula${\mmb{\theta } }=0$. Therefore, its solution path traces from Formula${\mmb{\theta } }^{opt}$ to 0, which is exactly the same as ours except that ours traces from 0 to Formula${\mmb{\theta } }^{opt}$. It should be remarked that the algorithm [12] cannot be applied to the under-determined case where no unique Formula${\mmb{\theta } }^{opt}$ exists. Our extended LARS works perfectly even in this case, as will be shown in the next section. We summarize these in the following theorem.

Theorem 6

The extended LARS is a gradient descent method based on the Formula$L_{1}$-Minkovskian gradient.

SECTION VIII

MINKOVSKIAN GRADIENT METHOD IN UNDER-DETERMINED CASE

We have so far assumed that Formula$\varphi ({\mmb{\theta } })$ is strictly convex. This implies that there exists a unique Formula${\mmb{\theta } }^{opt}$ satisfying Formula TeX Source $${\mmb{\eta } }^{opt} = \nabla \varphi \left ({\mmb{\theta } }^{opt}\right) = 0\eqno{\hbox{(101)}}$$ and Formula$G= \nabla \nabla \varphi$ is a full-rank positive-definite matrix. However, the Minkovskian gradient method works even in the under-determined case. This is because the optimality condition (49) is the same, and the least equi-angle Theorem holds in the same way. Note that Formula$\varphi ({\mmb{\theta } })$ is not strictly convex in the under-determined case so that the optimal solution Formula TeX Source $$\nabla \varphi ({\mmb{\theta } }) = 0\eqno{\hbox{(102)}}$$ is not unique but forms a Formula$k$-dimensional submanifold. In the regression case, Formula$k=p-n$. We need to assume that any Formula$m \times m$ principal submatrix of Formula$G$ is not degenerate when Formula$m \leq k$, which holds generically.

We show that the Minkovskian gradient algorithm works in the under-determined case. We have Formula${\mmb{\eta } }= \nabla \varphi({\mmb{\theta } })$, but we cannot have Formula${\mmb{\theta } }$ from Formula${\mmb{\eta } }$ uniquely in the under-determined case. By differentiating the Lagrangean problem (3), the optimal solution satisfies (49). We further differentiate it with respect to Formula$\lambda$, obtaining Formula TeX Source $$\eqalignno{G_{AA} {\mathdot {\mmb{\theta }}}^{opt, A}_{\lambda } =&\, -{\mbi{s} }_{A}, & {\hbox{(103)}}\cr{\mathdot {\mmb{\theta }}}^{opt, \bar {A}}_{\lambda } =&\, 0,& {\hbox{(104)}}}$$ while the active set is Formula$A$ is fixed. When Formula$\left \vert A\right \vert \leq k$, (96) and (97) or equivalently (98) and (99) hold, since Formula$G_{AA}$ is of full rank. Therefore, the Minkovskian gradient method works well, until the optimal solution is obtained, of which the number of non-zero components of Formula${\mmb{\theta } }^{(t)}$ is at most Formula$k$.

The algorithm proceeds in terms of Formula${\mmb{\eta } }^{(t)}_{\lambda }$ and Formula$A^{(t)}$, but it is possible to obtain Formula${\mmb{\theta } }^{(t)}$ from Formula${\mmb{\eta } }^{(t)}_{\lambda }$ and Formula$A^{(t)}$ by solving Formula TeX Source $$\eqalignno{{\mmb{\eta } }^{(t)}_{\lambda } =&\, \nabla \varphi\left ({\mmb{\theta } }^{(t)}_{\lambda }\right), & {\hbox{(105)}}\cr{\mmb{\theta } }_{\lambda }^{(t)\bar {A}} =&\, 0.& {\hbox{(106)}}}$$ The equations are solvable when any Formula$m \times m$ principal submatrix of Formula$G~(m \leq k)$ is non-degenerate.

SECTION IX

DISCUSSIONS

We searched for the solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$ as Formula$\lambda$ changes, but did not consider a consistent estimator which might be obtained by choosing Formula$\lambda$ adequately. In order to discuss an efficient consistent estimator, the oracle properties, identifying the true active set Formula$A$ without losing the optimal convergence rate as Formula$n$ tends to infinity, are proposed by [20]. It is known that the LASSO does not satisfy the oracle properties, but the adaptive LASSO proposed by [21] satisfies them. The adaptive LASSO uses the weighted Formula$L_{1}$ constraint Formula TeX Source $$F({\mmb{\theta } }, {\mbi{w} }) = \sum\limits w_{i} \left \vert \theta _{i} \right\vert ,\eqno{\hbox{(107)}}$$ and modifies Formula${\mbi{w} }$ adaptively in a data-dependent way, Formula TeX Source $${\mathhat {w}}_{i} = \left \vert {\mathhat {\theta}}_{i} \right \vert ^{-\gamma }, \qquad\gamma >0,\eqno{\hbox{(108)}}$$ where Formula${\mathhat {\mmb{\theta }}}$ is a consistent estimator.

The present paper does not search for consistent estimators but studies the solution path Formula${\mmb{\theta } }^{opt}_{\lambda }$, which includes various degrees of sparse solutions. However, it is interesting to study the geometry of the weighted Formula$L_{1}$ constraint problem. When Formula${\mbi{w} }$ is fixed, let Formula$W$ be the diagonal matrix with diagonal entries Formula$w_{i}$. We rescale Formula${\mmb{\theta } }$ and Formula${\mbi{x} }_{a}$ by Formula TeX Source $$\eqalignno{{\mathtilde {\mmb{\theta }}} =&\, W{\mmb{\theta } }, & {\hbox{(109)}}\cr{\mathtilde {\mbi{x}}}_{a} =&\, {\mbi{x} }_{a} W^{-1}.& {\hbox{(110)}}}$$ Then, the constraint is Formula$L_{1}$ in terms of Formula${\mathtilde {\mmb{\theta }}}$, Formula TeX Source $${\mathtilde {F}}\left ({\mathtilde {\mmb{\theta }}}\right) = \sum\limits \left\vert {\mathtilde {\theta}}_{i}\right \vert ,\eqno{\hbox{(111)}}$$ where the target function is Formula TeX Source $${\mathtilde {\varphi}} \left ({\mathtilde {\mmb{\theta }}}\right) = \varphi\left (W^{-1}{\mmb{\theta } }\right),\eqno{\hbox{(112)}}$$ in particular Formula TeX Source $${\mathtilde {\varphi}} \left ({\mathtilde {\mmb{\theta }}}\right) = {{1}\over{2}}\left \vert {\mbi{y} }- XW^{-1} {\mathtilde {\mmb{\theta }}}\right\vert ^{2}\eqno{\hbox{(113)}}$$ in the quadratic case.

The Riemannian metric changes to Formula TeX Source $${\mathtilde {G}} = {\mathtilde {\nabla}} {\mathtilde {\nabla}} \varphi\left ({\mathtilde {\mmb{\theta }}}\right)= W^{-1}G W^{-1},\eqno{\hbox{(114)}}$$ where Formula${\mathtilde {\nabla}}=\left (\partial / \partial {\mathtilde {\theta}}_{i}\right)$. Hence, the problem is formulated in the same way in Formula${\mathtilde {\mmb{\theta }}}$, but the geometry changes from Formula$G$ to Formula${\mathtilde {G}}$. The adaptive LASSO implies an adaptive selection of the geometry.

Instead of the adaptive Formula${\mathhat {\mbi{w}}}$ in (108), we may consider the case where the weight vector is given by Formula TeX Source $$w_{i} = \theta ^{-\gamma }_{i}, \qquad 0< r< 1.\eqno{\hbox{(115)}}$$ The constraint is then given by Formula TeX Source $$F({\mmb{\theta } }) = \sum\limits \left \vert \theta _{i} \right \vert ^{1-\gamma },\eqno{\hbox{(116)}}$$ so that the constraint is Formula$L_{p}$, Formula$p=1-\gamma$. The problem is non-convex optimization in this case. It is interesting to compare the adaptive LASSO which is solved in the framework of convex optimization with the nonconvex Formula$L_{p}$ problem, Formula$0< p< 1$. Information geometry is useful for studying the problem in this case, too.

SECTION X

CONCLUSIONS

We have elucidated the information-geometrical properties of convex optimization problems under the Formula$L_{1}$ norm constraint, following the ideas of Hirose and Komaki [12] in more details. The two dually coupled affine coordinates play a fundamental role, although the parameter space is not Euclidean but Riemannian, except for the case of optimization of a quadratic function. We proposed the Minkovskian gradient method which reverses the procedure proposed by [12]. This is an extension of the original LARS. It is numerically robust because it is a gradient descent method. The method can be applied even to the under-determined case, where the target function is not strictly convex. It is interesting to generalize the current approach to the Formula$L_{1/2}$-regularization problem, see, Xu, Chang, Xu and Zhang [22], [23]; Yukawa and Amari [24].

Footnotes

The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Shiro Ikeda.

S.-I. Amari is with the RIKEN Brain Science Institute, Saitama 351-0198, Japan (e-mail: amari@brain.riken.jp).

M. Yukawa is with the Department of Electrical and Electronic Engineering, Niigata University, Niigata 950-2181, Japan (e-mail: yukawa@eng.niigata-u.ac.jp).

References

No Data Available

Authors

Shun-ichi Amari

Shun-ichi Amari

Shun-ichi Amari (M'71–SM'92–F'94–LF'06) was graduated from the University of Tokyo in 1958, majoring in mathematical engineering, and received the Dr.Eng. degree from the University of Tokyo in 1963. He was an Associate Professor at Kyushu University, an Associate and then Full Professor at the Department of Mathematical Engineering and Information Physics, University of Tokyo, and is now Professor-Emeritus at the University of Tokyo. He is the Director of RIKEN Brain Science Institute, Saitama, Japan. He has been engaged in research in wide areas of mathematical engineering and applied mathematics, such as topological network theory, differential geometry of continuum mechanics, pattern recognition, mathematical foundations of neural networks, machine learning and information geometry.

Dr. Amari served as President of the International Neural Network Society, Council member of Bernoulli Society for Mathematical Statistics and Probability Theory, and President of the Institute of Electrical, Information and Communication Engineers. He was founding Coeditor-in-Chief of Neural Networks. He has been awarded the Japan Academy Award, IEEE Neural Networks Pioneer Award, IEEE Emanuel R. Piore Award, C&C Award, Neurocomputing best paper award, and IEEE Signal Processing Society best paper award, among many others.

Masahiro Yukawa

Masahiro Yukawa

Masahiro Yukawa (M'06) received the B.E., M.E., and Ph.D. degrees from Tokyo Institute of Technology in 2002, 2004, and 2006, respectively. From October 2006 to March 2007, he was a Visiting Researcher at the Department of Electronics, the University of York, U.K. From April 2007 to March 2008, he was with the Next Generation Mobile Communications Laboratory at RIKEN, Saitama, Japan, and, from April 2008 to March 2010, he was with the Brain Science Institute at RIKEN. From August to November 2008, he was a Guest Researcher at the Associate Institute for Signal Processing, the Technical University of Munich, Germany. He is currently an Associate Professor at Niigata University, Japan. His current research interests are in mathematical signal processing, nonlinear adaptive filtering, and sparse signal processing.

He is an Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. From April 2005 to March 2007, he was a recipient of the Research Fellowship of the Japan Society for the Promotion of Science (JSPS). He received the Excellent Paper Award and the Young Researcher Award from the IEICE in 2006 and in 2010, respectively, the Yasujiro Niwa Outstanding Paper Award from Tokyo Denki University in 2007, and the Ericsson Young Scientist Award from Nippon Ericsson in 2009. He is a member of the Institute of Electrical, Information and Communication Engineers (IEICE) of Japan.

Cited By

No Data Available

Keywords

Corrections

None

Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size