By Topic

- Aerospace
- Bioengineering
- Communication, Networking & Broadcasting
- Components, Circuits, Devices & Systems
- Computing & Processing
- Engineered Materials, Dielectrics & Plasmas

**I.**Introduction**II.**Optimization of Convex Function Under $L_{1}$ Constraint**III.**Information Geometry of Convex Optimization**IV.**Geometry of $L_{1}$ Optimization**V.**Inverse Projection**VI.**Solution Path**VII.**Minkovskian Gradient Method**VIII.**Minkovskian Gradient Method in Under-Determined Case**IX.**Discussions**X.**Conclusions

SECTION I

A sparse solution is obtained by minimizing a convex function under the constraint of $L_{1}$ norm. This is a typical case of linear regression, where the target function is a quadratic function. When a true solution is sparse, it is often recovered properly even when the number of observations is small. This has been extensively studied under the name of compressed sensing, see for example, Chen, Donoho and Saunders [1]; Candes and Wakin [2]; Donoho and Tsaig [3]; Bruckstein, Donoho and Elad [4]; Candes, Romberg and Tao [5]; Candes and Tao [6]; Elad [7]; Eldar and Kutyniok [8]; Donoho [9]; and many others.

There are a number of algorithms to obtain a sparse solution efficiently in the linear regression problem, such as LARS (Efron, Hastie, Johnstone and Tibshirani [10]), LASSO (Tibshirani [11]) and their variants. They can be extended from a quadratic cost function to a general convex function (Hirose and Komaki [12]; Friedman, Hastie and Tibshirani [13]). Since a convex function endows a dually flat Riemannian geometrical structure to the manifold of the parameter space (Amari and Nagaoka [14]; Amari and Cichocki [15]), information geometry is useful for solving the convex optimization problem. Hirose and Komaki [12] gave an extended LARS algorithm to be applicable to this problem, where the dually flat geometrical structure plays a fundamental role. See also Yukawa and Amari [16].

The present paper intends to elucidate geometrical properties of the solution path of a convex optimization problem under the parametric constraint that the $L_{1}$ norm is limited to $c$ or equivalently the Lagrangean problem with Lagrangean multiplier $\lambda$, where $c$ or $\lambda$ changes continuously. A main result is to show that a new version of the extended LARS is a steepest descent algorithm under the Minkovskian gradient, that is, the steepest direction of a function under the Minkovskian norm, the $L_{1}$ norm in the present case. Since the target function is convex, the gradient method is robust in the sense that numerical calculation errors are automatically corrected. We show that our Minkovskian gradient method is applicable to the under-determined case of compressed sensing, too, while the Hirose-Komaki algorithm [12] is applicable only to the over-determined case.

The present paper is organized as follows: After Introduction, we show a constrained optimization problem and its Lagrangean formulation together with a few typical examples in Section II. Section III is devoted to explanation of information geometry derived from a strictly convex function. Here, the Riemannian metric and dually coupled flat affine connections are introduced. They define two types of geodesics. We further explain a generalized Pythagorean theorem and projection theorem. In the particular case when the convex function is a quadratic function, the manifold is Euclidean and the two types of geodesics are identical. But the dually coupled affine coordinates play an important role even in this case.

In Section IV, we show that the constrained optimal solution is given by the dual geodesic projection of the unconstrained optimal solution to the $L_{1}$ constraint set, which is convex in the primal coordinates. Section V explains that the inverse projection gives a partition of the manifold, and its monotonic property is proved. Section VI gives the equation of a solution path from which the least equi-angle property of LARS is proved. Section VII introduces a Minkovskian gradient of a function, and proves that the extended LARS is a Minkovskian gradient descent method. We give an algorithm to obtain a solution path starting from the sparsest solution of 0, adding non-zero components one by one, like LARS. However, our purpose is to elucidate geometrical properties of the problem rather than to propose an efficient numerical algorithm. Hence we do not show any numerical examples. See Hirose and Komaki [12], for example, to see how the extended LARS works. It is a future problem to elaborate algorithmic detail of the Minkovskian gradient method. Section VIII shows that the Minkovskian gradient method works even for the under-determined problem, where the target function is convex but not strictly convex. Section X states conclusions.

SECTION II

We study the problem of minimizing a convex function $\varphi({\mmb{\theta } }), {\mmb{\theta } }= \left (\theta _{1},\ldots, \theta_{p} \right)\in {\mbi{R} }^{p}$, under a sparsity constraint. We use the $L_{1}$ constraint, where the $L_{1}$ norm of ${\mmb{\theta } }$ is constrained within a constant $c$, TeX Source $$F (\mmb{\theta }) = \sum\limits \left \vert \theta _{i} \right \vert \leq c.\eqno{\hbox{(1)}}$$ Then, the problem is formulated as TeX Source $${\rm Problem}~ P_{c} : {\rm Minimize} ~ \varphi (\mmb{\theta })~ {\rm under} ~F(\mmb{\theta }) \leq c.\eqno{\hbox{(2)}}$$ We can solve it by the Lagrange method, which is formulated as TeX Source $${\rm Problem} ~ P_{\lambda } : {\rm Minimize} ~ \varphi (\mmb{\theta }) +\lambda F (\mmb{\theta }),\eqno{\hbox{(3)}}$$ where $\lambda$ is a Lagrangian multiplier. We state three well-known examples.

Given $n$ design vectors of $p$ dimensions, TeX Source $${\mbi{x} }_{a} = \left (x_{a1},\ldots, x_{ap}\right), \quad a=1,\ldots, n,\eqno{\hbox{(4)}}$$$n$ responses TeX Source $$y_{a} = \sum\limits ^{p}_{i=1} x_{ai} \theta _{i} + \varepsilon _{a}, \quad a=1,\ldots, n,\eqno{\hbox{(5)}}$$ are observed, where ${\mmb{\theta } }= \left (\theta _{1},\ldots, \theta _{p}\right)$ are parameters to be estimated and $\varepsilon _{a}$ are independent 0-mean Gaussian noises subject to $N \left (0,\sigma ^{2}\right)$.

The maximum likelihood estimator is obtained by minimizing the sum of squared errors TeX Source $$\varphi ({\mmb{\theta } }) = {{1}\over {2}} \sum\limits \left (y_{a} - {\mmb{\theta } } \cdot{\mbi{x} }_{a} \right)^{2} = {{1}\over {2}} \left \vert{\mbi{y} } - X {\mmb{\theta } }\right \vert ^{2},\eqno{\hbox{(6)}}$$ where TeX Source $${\mbi{y} } = \left (\matrix{y_{1} \cr \vdots \cr y_{n}} \right), \quad X = \left (\matrix{{\mbi{x} }_{1} \cr \vdots \cr {\mbi{x} }_{n}} \right), \quad{\mmb{\theta } } = \left (\matrix{\theta _{1} \cr \vdots \cr \theta _{p}} \right).\eqno{\hbox{(7)}}$$ This is a quadratic function rewritten as TeX Source $$\varphi ({\mmb{\theta } }) = {{1}\over {2}} {\mmb{\theta } }^{\prime}G {\mmb{\theta } }- {\mbi{y} }^{\prime} X {\mmb{\theta } } + {{1}\over {2}}{\mbi{y} }^{\prime} {\mbi{y} },\eqno{\hbox{(8)}}$$ where TeX Source $$G = X^{\prime} X,\eqno{\hbox{(9)}}$$ “${}^{\prime}{}$” denoting transposition. When the number $n$ of observations is larger than the number $p$ of parameters, the problem is over-determined in general and $\varphi (\mmb{\theta })$ is a strictly convex function. In this case, $G$ is a positive-definite matrix and we have a unique minimizer ${\mmb{\theta } }^{opt}$ of $\varphi$ which satisfies TeX Source $$\nabla \varphi \left ({\mmb{\theta } }^{opt}\right) = 0,\eqno{\hbox{(10)}}$$ where $\nabla$ is the gradient operator, $\nabla ={{\partial }/ {\partial \theta _{i}}}$. The minimizer is given by TeX Source $${\mmb{\theta } }^{opt} = G^{-1}X{\mbi{y} }.\eqno{\hbox{(11)}}$$

When $n< p$, the problem is under-determined. In this case, $G$ is positive-semidefinite and its rank degenerates. The minimizer of $\varphi$ is not unique. The minimizers of $\varphi$ form an affine subspace. We need to use a sparsity constraint for obtaining a unique sparse solution. We study the over-determined case first, but our Minkovskian gradient method is applicable to the under-determined case as well.

Consider the case where noise $\varepsilon _{a}$ is not Gaussian and its probability density function is subject to TeX Source $$p(\varepsilon) = \kappa \exp \left \{-\psi (\varepsilon)\right \},\eqno{\hbox{(12)}}$$ where $\psi (\varepsilon)$ is a convex function and $\kappa > 0$ is a constant. Typically, TeX Source $$\psi (\varepsilon) = \vert \varepsilon \vert ^{k}\eqno{\hbox{(13)}}$$ and is the Laplace noise for $k=1$. The linear regression problem is to minimize TeX Source $$\varphi ({\mmb{\theta } }) = \sum\limits _{a} \psi\left (y_{a} -{\mbi{x} }_{a} \cdot {\mmb{\theta } }\right).\eqno{\hbox{(14)}}$$ The target function is a convex function of ${\mmb{\theta } }$ but is not quadratic except for the Gaussian case of $k=2$.

A logistic regression problem has binary responses, $y_{a}=0, 1$, $a=1, \ldots, n$, and its probability is given by the logistic curve, TeX Source $${\rm Prob} \left \{y_{a}=y\right \} = \exp\left \{\xi _{a} y-\psi (\xi _{a})\right \}, \quad y=0, 1,\eqno{\hbox{(15)}}$$ where $\psi (\xi)$ is the normalization term given by TeX Source $$\psi (\xi) = \log \left \{1+ \exp (\xi)\right \}.\eqno{\hbox{(16)}}$$ The parameter $\xi _{a}$ for $y_{a}$ is given by TeX Source $$\xi _{a} = {\mbi{x} }_{a} \cdot {\mmb{\theta } }.\eqno{\hbox{(17)}}$$ The loss function is the negative of the sum of $\log$ probabilities (15) TeX Source $$\varphi (\mmb{\theta }) = -\sum\limits y_{a} {\mbi{x} }_{a} \cdot {\mmb{\theta } }+ \sum\limits \psi \left ({\mbi{x} }_{a} \cdot {\mmb{\theta } }\right),\eqno{\hbox{(18)}}$$ which is a convex function (strictly convex when $n \geq p$). The optimal solution is given by the solution of TeX Source $$\sum\limits \left \{y_{a}- {{\exp \left ({\mbi{x} }_{a} \cdot {\mmb{\theta } }\right)}\over{1+ \exp\left ({\mbi{x} }_{a} \cdot {\mmb{\theta } }\right)}}\right \} \cdot{\mbi{x} }_{a} = 0.\eqno{\hbox{(19)}}$$

SECTION III

A strictly convex smooth function induces a geometrical structure in a manifold $M$. See information geometry [14]. It becomes a Riemannian manifold, where a Riemannian metric tensor $G=\left (g_{ij}\right)$ is defined by the Hessian of $\varphi$, TeX Source $$G({\mmb{\theta } }) = \nabla \nabla \varphi ({\mmb{\theta } }).\eqno{\hbox{(20)}}$$ The squared length of a small line element $d{\mmb{\theta } }$ is given by the quadratic form TeX Source $$ds^{2} = \langle d {\mmb{\theta } }, d{\mmb{\theta } } \rangle= d {\mmb{\theta } }^{\prime}G({\mmb{\theta } })d{\mmb{\theta } } = \sum\limits g_{ij} d \theta _{i} d \theta_{j}.\eqno{\hbox{(21)}}$$ When there are two small line elements $d_{1} {\mmb{\theta } }$ and $d_{2}{\mmb{\theta } }$, their inner product at ${\mmb{\theta } }$ is given by TeX Source $$\langle d_{1} {\mmb{\theta } }, d_{2} {\mmb{\theta } } \rangle =d_{1} {\mmb{\theta } }^{\prime} G ({\mmb{\theta } }) d_{2} {\mmb{\theta } },\eqno{\hbox{(22)}}$$ and they are orthogonal when it vanishes.

A divergence function $D[P:Q]$, called the Bregman divergence, is introduced between two points $P$ and $Q$ in $M$ based on $\varphi ({\mmb{\theta } })$. It is defined as TeX Source $$D [P:Q] = \varphi \left ({\mmb{\theta } }_{P} \right)-\varphi \left ({\mmb{\theta } }_{Q} \right)- \nabla \varphi\left ({\mmb{\theta } }_{Q} \right) \cdot \left ({\mmb{\theta } }_{P}-{\mmb{\theta } }_{Q} \right),\eqno{\hbox{(23)}}$$ where ${\mmb{\theta } }_{P}$ and ${\mmb{\theta } }_{Q}$ are the coordinates of $P$ and $Q$, respectively. See Fig. 1. The divergence is non-negative, and is equal to 0 when and only when $P=Q$. It is not a distance, because it is not symmetric in general, that is $D[P:Q] = D[Q:P]$ does not hold in general. When $P$ is infinitesimally close to $Q, Q=P+dP$, let ${\mmb{\theta } }$ and ${\mmb{\theta } }+ d{\mmb{\theta } }$ be their coordinates. Then, the Taylor expansion proves that their divergence is related to the Riemannian metric, TeX Source $$D[P: P+dP] = {{1}\over {2}} \sum\limits g_{ij} d \theta _{i} d \theta _{j}.\eqno{\hbox{(24)}}$$

When an affine connection is introduced in $M$, it defines “straightness” of a curve. A straight curve is called a geodesic. Two dually coupled affine connections and related covariant derivatives are naturally introduced in $M$ by using a divergence function $D$ [14], [15]. Hence, two types of geodesics are defined by the two affine connections. However, we avoid to state details of differential geometry. We state only that a manifold $M$ having convex $\varphi ({\mmb{\theta } })$ defines two affine connections which are flat. Since they are flat, we have two special affine coordinate systems. They are affine in the respective senses, and the geodesics are given as linear curves in the respective coordinate systems.

One affine coordinate system is ${\mmb{\theta } }$ itself in terms of which a convex function $\varphi$ is defined. The other is its Legendre transform TeX Source $${\mmb{\eta } } = \nabla \varphi ({\mmb{\theta } }),\eqno{\hbox{(25)}}$$ where ${\mmb{\theta } }$ and ${\mmb{\eta } }$ are in one-to-one correspondence. It is easy to see from (20) that the Riemannian metric is given by TeX Source $$g_{ij}({\mmb{\theta } }) = {{\partial \eta _{i} ({\mmb{\theta } })}\over{\partial\theta _{j}}}.\eqno{\hbox{(26)}}$$

The Legendre duality guarantees the existence of another convex function of ${\mmb{\eta } }$, defined by TeX Source $$\varphi ^{\ast }({\mmb{\eta } }) = \mathop {\max }_{\mmb{\theta } }\left \{{\mmb{\theta } } \cdot {\mmb{\eta } } - \varphi ({\mmb{\theta } })\right \}.\eqno{\hbox{(27)}}$$ The ${\mmb{\theta } }$ coordinates are recovered by its gradient, TeX Source $${\mmb{\theta } } = \nabla \varphi ^{\ast }({\mmb{\eta } }).\eqno{\hbox{(28)}}$$ Another Bregman divergence $D^{\ast }$ is defined in the coordinates ${\mmb{\eta } }$ by using the dual convex function $\varphi ^{\ast }$ as TeX Source $$D^{\ast }[P:Q] = \varphi ^{\ast } \left ({\mmb{\eta } }_{P} \right)-\varphi ^{\ast } \left ({\mmb{\eta } }_{Q} \right) -\nabla \varphi ^{\ast }\left ({\mmb{\eta } }_{Q} \right) \cdot \left ({\mmb{\eta } }_{P}-{\mmb{\eta } }_{Q} \right).\eqno{\hbox{(29)}}$$ However, we can prove that it satisfies TeX Source $$D^{\ast }[P : Q] = D[Q : P]\eqno{\hbox{(30)}}$$ so that they are substantially the same, except for the order of points. The divergence is written concisely by using both ${\mmb{\theta } }$ and ${\mmb{\eta } }$ coordinates, TeX Source $$D[P:Q] = \varphi \left ({\mmb{\theta } }_{P} \right) +\varphi ^{\ast } \left ({\mmb{\eta } }_{Q} \right)-{\mmb{\theta } }_{P}\cdot {\mmb{\eta } }_{Q}\eqno{\hbox{(31)}}$$ where ${\mmb{\theta } }_{P}$ and ${\mmb{\theta } }_{Q}$ are the ${\mmb{\theta } }$ coordinates of $P$ and $Q$, and ${\mmb{\eta } }_{P}$ and ${\mmb{\eta } }_{Q}$ are the ${\mmb{\eta } }$ coordinates of $P$ and $Q$, respectively.

Since the Jacobian of the coordinate transformation from ${\mmb{\eta } }$ to ${\mmb{\theta } }$ is given by (26), the Jacobian of the reverse transformation is TeX Source $$g^{\ast }_{ij} = {{\partial \theta _{i}}\over {\partial \eta _{j}}},\eqno{\hbox{(32)}}$$ which is the inverse of $G=\left (g_{ij}\right)$. Hence, $G^{\ast } =G^{-1}= \left (g^{\ast }_{ij}\right)$ is the Riemannian metric tensor expressed in the coordinate system ${\mmb{\eta } }$.

Let us consider a curve ${\mmb{\theta } }(t)$ parameterized by $t$. Its tangent vector at $t$ is given by TeX Source $${\mathdot {\mmb{\theta }}}(t) = \sum\limits {\mathdot {\theta}}_{i} (t){\mbi{e} }_{i},\eqno{\hbox{(33)}}$$ where ${\mathdot {\theta}}_{i} = (d/dt) \theta _{i}(t)$ and ${\mbi{e} }_{i}$ is the tangent vector along the coordinate axis $\theta _{i}$. Since ${\mmb{\theta } }$ is an affine coordinate system, a geodesic (${\mmb{\theta } }$-geodesic) is a curve written as TeX Source $${\mmb{\theta } }(t) = t{\mbi{a} }+ {\mbi{b} }\eqno{\hbox{(34)}}$$ in the ${\mmb{\theta } }$ coordinate system, where ${\mbi{a} }$ and ${\mbi{b} }$ are constant vectors.

Dually, we can express a curve in the η coordinate system, which is expressed as ${\mmb{\eta } }(t)$. By using the tangent vector ${\mbi{e} }^{\ast }_{i}$ of the coordinate axis $\eta _{i}$, the tangent vector of ${\mmb{\eta } }(t)$ is written as TeX Source $${\mathdot {\mmb{\eta }}}(t) = \sum\limits {\mathdot {\eta}}_{i} (t){\mbi{e} }^{\ast }_{i}.\eqno{\hbox{(35)}}$$ A dual geodesic which we call an ${\mmb{\eta } }$-geodesic, is written as TeX Source $${\mmb{\eta } }(t) = t{\mbi{a} }^{\ast }+ {\mbi{b} }^{\ast}\eqno{\hbox{(36)}}$$ in the ${\mmb{\eta } }$ coordinate system. This is not a ${\mmb{\theta } }$-geodesic in general.

The Riemannian metric $G$ is given by the inner products of basis vectors, TeX Source $$g_{ij}({\mmb{\theta } }) = \langle {\mbi{e} }_{i}, {\mbi{e} }_{j} \rangle.\eqno{\hbox{(37)}}$$ Similarly, $G^{\ast }$ is given by TeX Source $$g^{\ast }_{ij} = \langle {\mbi{e} }^{\ast }_{i}, {\mbi{e} }^{\ast}_{j} \rangle ,\eqno{\hbox{(38)}}$$ From (37), (38) and $G^{\ast }=G^{-1}$, we see that the two bases of tangent vectors are related by TeX Source $${\mbi{e} }_{i} = \sum\limits g_{ij} {\mbi{e} }^{\ast }_{j}, \quad{\mbi{e} }^{\ast }_{j} = \sum\limits g^{\ast }_{ji}{\mbi{e} }_{i}.\eqno{\hbox{(39)}}$$ We have an important result from this. See Fig. 2.

The two affine coordinate systems ${\mmb{\theta } }$ and ${\mmb{\eta } }$ are reciprocal, that is, their tangent vectors ${\mbi{e} }_{i}$ and ${\mbi{e} }^{\ast }_{j}$ are orthogonal at any points when $i \ne j$, TeX Source $$\langle {\mbi{e} }_{i}, {\mbi{e} }^{\ast }_{j} \rangle = \delta _{ij},\eqno{\hbox{(40)}}$$ where TeX Source $$\delta _{ij} = \cases{1, \hfill & $i=j$, \hfill \cr 0, \hfill & $i \ne j$.\hfill }\eqno{\hbox{(41)}}$$

Now we state a fundamental theorem of a dually flat manifold.

For three points $P$, $Q$ and $R$, when ${\mmb{\theta } }$-geodesic connecting $P$ and $Q$ is orthogonal to ${\mmb{\eta } }$-geodesic connecting $Q$ and $R$, TeX Source $$D[P:Q] + D[Q:R] = D[P:R].\eqno{\hbox{(42)}}$$ Dually, when ${\mmb{\eta } }$-geodesic connecting $P$ and $Q$ is orthogonal to ${\mmb{\theta } }$-geodesic connecting $Q$ and $R$, TeX Source $$D^{\ast } [P:Q]+D^{\ast }[Q:R] = D^{\ast }[P:R].\eqno{\hbox{(43)}}$$

See Fig. 3. We finally define a geodesic projection in a dually flat manifold. Let $S$ be a smooth submanifold in a dually flat manifold $M$ and let $P$ be a point outside $S$. Let us consider a point $Q \in S$. When the ${\mmb{\eta } }$-geodesic connecting $P$ and $Q$ is orthogonal to $S$, the point $Q$ is called the ${\mmb{\eta } }$-projection of $P$ to $S$. (We can define the ${\mmb{\theta } }$-projection similarly.) We now have the ${\mmb{\eta } }$-projection theorem from the Pythagorian theorem.

For a smooth submanifold $S$ in a dually flat manifold $M$ and a point $P$ outside $S$, the point in $S$ that minimizes divergence $D^{\ast }[P:Q]$, $Q \in S$, is the ${\mmb{\eta } }$-projection of $P$ to $S$. Moreover, when $S$ is convex in ${\mmb{\theta } }$ coordinates, the ${\mmb{\eta } }$-projection exists and is unique.

In the special case when $\varphi ({\mmb{\theta } })$ is a quadratic function, the Riemann metric $G$ does not depend on ${\mmb{\theta } }$, so that it is a constant tensor. The manifold is Euclidean from the point of view of the Riemannian metric. The two affine coordinates are linearly related in this case, TeX Source $${\mmb{\eta } } = G{\mmb{\theta } }+ {\mbi{c} }\eqno{\hbox{(44)}}$$ for constant ${\mbi{c} }$. Hence, a ${\mmb{\theta } }$-geodesic is an ${\mmb{\eta } }$-geodesic at the same time. So two types of geodesics are identical, and they are merely a Euclidean geodesic. But each of the coordinate systems ${\mmb{\theta } }$ and ${\mmb{\eta } }$ is not orthogonal but oblique. They are mutually orthogonal, that is, reciprocal systems. However, when $\varphi ({\mmb{\theta } })$ is not a quadratic function, $G$ depends on ${\mmb{\theta } }$ and the manifold is not Euclidean, although it has two mutually dual affine structures from the point of view of information geometry.

SECTION IV

LASSO obtains a solution that minimizes a quadratic function $\varphi ({\mmb{\theta } })$ under the $L_{1}$ constraint. We consider here an over-determined case since it is easier to explain the geometrical structure. However, our algorithm works in the under-determined case as well. Let ${\mmb{\theta } }^{opt}_{c}$ be the optimal solution of the Problem $P_{c}$. Since the constraint region specified by $c$ TeX Source $$R_{c} = \left \{{\mmb{\theta } } \left \vert \; \sum\limits \left \vert \theta _{i}\right\vert\leq c \right.\right \}\eqno{\hbox{(45)}}$$ is a ${\mmb{\theta } }$-convex set, the optimal solution is unique and is given by the projection of ${\mmb{\theta } }^{opt}$ to the boundary $B_{c}$ of polyhedron $R_{c}$. Note that $B_{c}$ is the boundary of $R_{c}$ and is piecewise linear, including non-differentiable points such as vertices, edges, and higher-dimensional subfaces.

It is well known in a Euclidean space that projection of ${\mmb{\theta } }^{opt}$ to a smooth convex region encircled by $S({\mmb{\theta } })=c$ is the point ${\mmb{\theta } }_{S}^{opt} \in S$ such that the vector ${\mmb{\theta } }^{opt}-{\mmb{\theta } }^{opt}_{S}$ connecting ${\mmb{\theta } }^{opt}$ and ${\mmb{\theta } }^{opt}_{S}$ is orthogonal to $S$, TeX Source $${\mmb{\theta } }^{opt}-{\mmb{\theta } }^{opt}_{S} \propto G^{-1} \nabla S\left ({\mmb{\theta } }^{opt}_{S} \right).\eqno{\hbox{(46)}}$$ Here, $G$ is a constant tensor given by $g_{ij}= \langle {\mbi{e} }_{i},{\mbi{e} }_{j} \rangle$ and is not necessarily the identity matrix if ${\mmb{\theta } }$ is not orthonormal but oblique. Hence, $G^{-1}\nabla S({\mmb{\theta } })$ denotes the normal vector of $S$. In the present case, $B_{c}$ is not differentiable but is piecewise differentiable. When ${\mmb{\theta } }^{opt}_{B_{c}}$ belongs to a hypersurface of $B_{c}$, (46) is satisfied. But when it belongs to a subface (say an edge of $B_{c}$), the projection is defined by using the subgradient, instead of the gradient. We explain this.

Function $F({\mmb{\theta } })$ is not differentiable when some of $\theta _{i}=0$. Non-differentiable positions sit in subfaces of the polyhedron, where some $\theta _{i}=0$. In order to show which subface ${\mmb{\theta } }$ belongs to, we define an active set of indices for each ${\mmb{\theta } }$, TeX Source $$A({\mmb{\theta } }) = \left \{i \left \vert \; \theta _{i} \ne 0 \right.\right\},\quad A \subset N = \left \{1, 2,\ldots, p \right \}.\eqno{\hbox{(47)}}$$ Then, $F({\mmb{\theta } })$ is not differentiable at points of which active sets are not equal to $N$.

A subgradient $\nabla F({\mmb{\theta } })$ is used in convex analysis when $F$ is not differentiable (Bertsekas [17], Boyd and Vandenberghe [18]). It is a normal vector of one of supporting hypersurfaces of $F$, and is written in the component form as TeX Source $$\left (\nabla F \right)_{i} = \cases{{{\partial }\over {\partial \theta _{i}}}F({\mmb{\theta } }) = {\rm sgn} \; \theta _{i}, \hfill & $i \in A$, \hfill \cr \left (\nabla F \right)_{i} \in \left [-1, 1 \right ], \hfill & $i \in \bar {A} = N-A$,\hfill }\eqno{\hbox{(48)}}$$ in the present case of (1). Here, the $i$-th component $\left (\nabla F \right)_{i}$ of a subgradient $\nabla F$ is the ordinary gradient for $i \in A$ and is equal to the signature of $\theta _{i}$, but is any value in the interval $\left [-1, 1\right ]$ for $i \in \bar {A}$ where $\theta _{i}=0$. See Fig. 4. The set of all subgradients at each point ${\mmb{\theta } }$ is denoted by $\partial F ({\mmb{\theta } })$ and is called the subdifferential of $F$ at ${\mmb{\theta } }$ [17].

In the present case where $\varphi$ is not necessarily quadratic, by differentiating the Lagrangean formulation (3) with respect to $\lambda$, the optimal solution ${\mmb{\theta } }^{opt}_{\lambda }$ satisfies TeX Source $$\nabla \varphi \left ({\mmb{\theta } }^{opt}_{\lambda }\right) =-\lambda \nabla F \left ({\mmb{\theta } }_{\lambda }^{opt}\right).\eqno{\hbox{(49)}}$$ However, when the active set $A\left ({\mmb{\theta } }^{opt}_{\lambda }\right)$ is not $N$, the $\nabla F$ is a subgradient. Since TeX Source $${\mmb{\eta } }^{opt} = \nabla \varphi \left ({\mmb{\theta } }^{opt}\right)=0,\eqno{\hbox{(50)}}$$ (49) is written as TeX Source $${\mmb{\eta } }^{opt}-{\mmb{\eta } }^{opt}_{\lambda } = \lambda \nabla F\left ({\mmb{\theta } }^{opt}_{\lambda }\right)\eqno{\hbox{(51)}}$$ in the ${\mmb{\eta } }$-coordinates. This shows the ${\mmb{\eta } }^{opt}_{\lambda }$ is the ${\mmb{\eta } }$-projection of ${\mmb{\eta } }^{opt}$ to $B_{c}$, where $\lambda = \lambda (c)$ is determined from $c$.

Indeed, we have from (10), (30) and (31), TeX Source $$D^{\ast } \left [{\mmb{\theta } }^{opt} : {\mmb{\theta } }\right] =\varphi ({\mmb{\theta } })-\varphi ^{\ast } \left ({\mmb{\eta } }^{opt}\right).\eqno{\hbox{(52)}}$$ Hence, minimizing $\varphi ({\mmb{\theta } })$ on $B_{c}$ is equivalent to minimizing $D^{\ast } \left [{\mmb{\theta } }^{opt} : {\mmb{\theta } }\right]$ on $B_{c}$. Information geometry shows that ${\mmb{\theta } }^{opt}_{\lambda }$ is the ${\mmb{\eta } }$-projection of ${\mmb{\theta } }^{opt}$ to $B_{c}$.

The ${\mmb{\eta } }$-geodesic connecting ${\mmb{\eta } }_{c}$ and ${\mmb{\eta } }^{opt}$ is TeX Source $${\mmb{\eta } }(t) = (1-t) {\mmb{\eta } }^{opt} + t {\mmb{\eta } }_{c}= t{\mmb{\eta } }_{c},\eqno{\hbox{(53)}}$$ and its tangent direction is TeX Source $${\mathdot {\mmb{\eta }}}(t) = {\mmb{\eta } }_{c}.\eqno{\hbox{(54)}}$$ Hence, from the projection theorem, we have the optimality condition. See Fig. 5.

The ${\mmb{\eta } }$-projection of ${\mmb{\theta } }^{opt}$ to $B_{c}$ is ${\mmb{\eta } }^{opt}_{c}$ in the ${\mmb{\eta } }$-coordinates that satisfies TeX Source $${\mmb{\eta } }^{opt}_{c} = -\lambda \nabla F \left ({\mmb{\theta } }^{opt}_{c} \right)\eqno{\hbox{(55)}}$$ for one of the subgradients.

We denote by $\Pi _{c}$ the ${\mmb{\eta } }$-geodesic projection operator to $B_{c}$, so that TeX Source $$\Pi _{c} {\mmb{\theta } }^{opt} = {\mmb{\theta } }^{opt}_{c}.\eqno{\hbox{(56)}}$$

SECTION V

The optimal solution ${\mmb{\theta } }^{opt}_{c}$ on $B_{c}$, that is, the ${\mmb{\eta } }$-geodesic projection of ${\mmb{\theta } }^{opt}$ to $B_{c}$, is unique since $R_{c}$ is ${\mmb{\theta } }$-convex. Consider the inverse projection. Given a point ${\mmb{\theta } }$ on $B_{c}$, we define the set of points ${\mmb{\theta } }^{\prime}$ outside $B_{c}$ such that the ${\mmb{\eta } }$-geodesic projection of ${\mmb{\theta } }^{\prime}$ to $B_{c}$ is equal to ${\mmb{\theta } }$, TeX Source $$\Pi ^{-1}_{c} {\mmb{\theta } } = \left \{{\mmb{\theta } }^{\prime} \left \vert\; \Pi _{c} {\mmb{\theta } }^{\prime} = {\mmb{\theta } }\right.\right\}.\eqno{\hbox{(57)}}$$ This is called the inverse ${\mmb{\eta } }$-projection of ${\mmb{\theta } }$. The inverse ${\mmb{\eta } }$-projection of ${\mmb{\theta } } \in B_{c}$ is written simply by using the dual coordinates, TeX Source $$\Pi ^{-1}_{c} {\mmb{\eta } } = \left \{{\mmb{\eta } }^{\prime} \left \vert \;{\mmb{\eta } }^{\prime}-{\mmb{\eta } } = t \nabla F({\mmb{\eta } })\right.\right \}.\eqno{\hbox{(58)}}$$ When ${\mmb{\theta } }$ belongs to a face of $B_{c}$, that is, when $A({\mmb{\theta } })=N$, $\nabla F({\mmb{\theta } })$ is unique so that $\Pi ^{-1}_{c} {\mmb{\theta } }$ is the ${\mmb{\eta } }$-geodesic in the direction of $\nabla F$ passing through ${\mmb{\theta } }$. In the ${\mmb{\eta } }$-coordinates, we have TeX Source $${\mmb{\eta } }^{\prime} = {\mmb{\eta } }+ t \nabla F({\mmb{\theta } }).\eqno{\hbox{(59)}}$$ When ${\mmb{\theta } }$ belongs to a subface of which active set $A({\mmb{\theta } })$ is not $N$, $\nabla F({\mmb{\theta } })$ is not unique. In this case, the inverse projection ${\mmb{\eta } }^{\prime}$ is written TeX Source $$\Pi ^{-1}_{c} {\mmb{\eta } } =\left \{{\mmb{\eta } }^{\prime} = {\mmb{\eta } }+ t{\mbi{n} } \left \vert{\mbi{n} } \in \partial F({\mmb{\theta } }), \; t \geq 0\right. \right \}.\eqno{\hbox{(60)}}$$ See Fig. 6.

For $c^{\prime}< c$, let us consider how $\Pi ^{-1}_{c}$ and $\Pi ^{-1}_{c^{\prime}}$ are related. When ${\mmb{\theta } }_{c^{\prime}}$ is in a face of $B_{c^{\prime}}$, i.e., $A\left ({\mmb{\theta } }_{c^{\prime}}\right) = N$, if the ${\mmb{\eta } }$-geodesic orthogonal to $B_{c^{\prime}}$, passes through ${\mmb{\theta } }_{c} \in B_{c}$, the geodesic $\Pi ^{-1}_{c} {\mmb{\theta } }_{c}$ is a part of $\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}}$ (Fig. 7). Hence, TeX Source $$\Pi ^{-1}_{c} {\mmb{\theta } }_{c} \subset\Pi ^{-1}_{c^{\prime}}{\mmb{\theta } }_{c^{\prime}}.\eqno{\hbox{(61)}}$$ When ${\mmb{\theta } }_{c^{\prime}}$ lies on a subface specified by $A \;(\ne N)$, its inverse image forms an ${\mmb{\eta } }$-flat cone defined in the ${\mmb{\eta } }$-coordinates by TeX Source $$\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}} = \left \{{\mmb{\eta } } \left \vert \; {\mmb{\eta } }-{\mmb{\eta} }_{c^{\prime}} = t \nabla F\left ({\mmb{\theta } }_{c^{\prime}} \right), \; t \geq 0 \right.\right \}.\eqno{\hbox{(62)}}$$ Let ${\mmb{\theta } }_{c}$ be a point in $B_{c}$ included in $\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}}$. When, $A\left ({\mmb{\theta } }_{c} \right) = A\left ({\mmb{\theta } }_{c^{\prime}}\right)$, the subdifferentials are identical, $\partial F \left ({\mmb{\theta } }_{c}\right) = \partial F\left ({\mmb{\theta } }_{c^{\prime}}\right)$. Then, $\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}}$ is a ${\mmb{\eta } }$-parallel transport of $\Pi ^{-1}_{c} {\mmb{\theta } }_{c}$ (Fig. 7). Hence, we have the inclusion theorem.

Let ${\mmb{\theta } }_{c}$ and ${\mmb{\theta } }_{c^{\prime}} \;\left (c^{\prime} < c \right)$ be two points satisfying ${\mmb{\theta } }_{c}\in \Pi ^{-1}_{c^{\prime}}{\mmb{\theta } }_{c^{\prime}}$. Then TeX Source $$\Pi ^{-1}_{c} {\mmb{\theta } }_{c} \subset\Pi ^{-1}_{c^{\prime}} {\mmb{\theta } }_{c^{\prime}},\quad c^{\prime}< c.\eqno{\hbox{(63)}}$$

The theorem implies that, as $c$ decreases, the region of cone $\Pi ^{-1}_{c} {\mmb{\theta } }_{c}$ increases monotonically. When $A\left ({\mmb{\theta } }_{c^{\prime}} \right)$ is not full, the inverse image $\Pi ^{-1}_{c^{\prime}}{\mmb{\theta } }_{c^{\prime}}$ includes not only $\Pi ^{-1}_{c}{\mmb{\theta } }_{c}$ but also some of $\Pi ^{-1}_{c}{\mathtilde {\mmb{\theta }}}_{c}$, where the active set $A\left ({\mathtilde {\mmb{\theta }}}_{c} \right)$ is properly larger than $A\left ({\mmb{\theta } }_{c} \right)$. This is a property of LARS [10] and the extended LARS [12].

Given ${\mmb{\theta } }^{opt}$, the active set $A\left ({\mmb{\theta } }_{c}^{opt}\right)$ monotonically decreases as $c$ decreases, letting ${\mmb{\theta } }^{opt}_{c}$ sparser and sparser.

SECTION VI

We analyze the ${\mmb{\eta } }$-projection ${\mmb{\eta } }^{opt}_{c}$ as a function of $c$. It forms a solution path, connecting ${\mmb{\eta } }^{opt}_{c} = 0$ for $c=0$ and ${\mmb{\eta } }^{opt}_{c} ={\mmb{\eta } }^{opt}$ for sufficiently large $c$. When ${\mmb{\eta } }^{opt}_{c}$ belongs to a face of $B_{c}$, that is, the active set $A$ is $N$, ${\theta }^{opt}_{c, i} \ne 0$ for all $i$. Hence, we have from (55) TeX Source $$\eta ^{opt}_{\lambda, i} = -\lambda s_{i},\eqno{\hbox{(64)}}$$ where TeX Source $$s_{i} = {{\partial }\over {\partial \theta _{i}}} F ({\mmb{\theta } }) = {\rm sgn}\left (\theta _{i} \right).\eqno{\hbox{(65)}}$$

When the projection falls in a lower-dimensional subface of which the active set is $A$, ${\mmb{\eta } }^{opt}_{\lambda}$ is equal to a subgradient, $-\lambda \nabla F \left ({\mmb{\theta } }^{opt}_{\lambda } \right)$. We first analyze a solution path ${\mmb{\theta } }^{opt}_{\lambda }$ or ${\mmb{\eta } }^{opt}_{\lambda }$ in the interval where the active set $A$ does not change. We partition and rearrange components of ${\mmb{\theta } }$ and ${\mmb{\eta } }$ according to their indices belonging to $A$ or $\bar {A}$, as TeX Source $${\mmb{\theta } } = \left ({\mmb{\theta } }^{A}, {\mmb{\theta} }^{\bar {A}}\right),\quad{\mmb{\eta } } = \left ({\mmb{\eta } }^{A}, {\mmb{\eta} }^{\bar {A}}\right),\eqno{\hbox{(66)}}$$ where ${\mmb{\theta } }^{A}$ is a subvector of ${\mmb{\theta } }$ of which components are nonzero and their indices belonging to $A$. The other notations are similarly understood. We further define a signature vector ${\mbi{s} }^{A} = \left (s^{A}_{i} \right)$, $i \in A$, by TeX Source $$s^{A}_{i} = {\rm sgn}\; \theta _{i}, \quad i \in A.\eqno{\hbox{(67)}}$$ Then, we have the equations to determine ${\mmb{\theta } }^{opt}_{\lambda }$ and ${\mmb{\eta } }^{opt}_{\lambda }$ in terms of the partitioned forms of $A$ and $\bar {A}$ components, TeX Source $$\eqalignno{{\mmb{\eta } }^{opt, A}_{\lambda } =&\, -\lambda {\mbi{s} }^{A}, & {\hbox{(68)}}\cr{\mmb{\theta } }^{opt, \bar {A}}_{\lambda } =&\, 0, & {\hbox{(69)}}\cr\sum\limits \left \vert \theta ^{A}_{i} \right \vert =&\, c.& {\hbox{(70)}}}$$ The other components, $\mmb{\theta } ^{opt, A}_{\lambda }$ and ${\mmb{\eta } }^{opt, \bar {A}}_{\lambda }$, are determined from the ${\mmb{\theta } }$-${\mmb{\eta } }$ correspondence (25), (28). Since ${\mmb{\eta } }^{opt}_{\lambda }$ is a subgradient $\nabla F$, we note that TeX Source $$\left \vert {\eta }^{opt, \bar {A}}_{\lambda, i} \right \vert \leq \lambda =\left \vert \eta ^{opt, A}_{\lambda, i} \right \vert.\eqno{\hbox{(71)}}$$

The solution path ${\mmb{\theta } }^{opt}_{\lambda }$ (${\mmb{\eta } }^{opt}_{\lambda }$ in the $\eta$-coordinates) is continuous and is piecewise differentiable. It changes the direction discontinuously when its active set alters. It is differentiable while $A$ does not change. We consider the path ${\mmb{\theta } }^{opt}_{\lambda }$ inside a subface specified by $A$. By differentiating (68) and (69) with respect to $\lambda$, we have the following lemma.

The solution path inside a fixed subface specified by $A$ satisfies TeX Source $$\eqalignno{{\mathdot {\mmb{\theta }}}^{opt, A}_{\lambda } =&\, -G^{-1}_{AA} {\mbi{s} }^{A}, & {\hbox{(72)}}\cr{\mathdot {\mmb{\theta }}}^{opt, \bar {A}}_{\lambda } =&\, 0,& {\hbox{(73)}}}$$ where ${\mathdot {\mmb{\theta }}}_{\lambda }= (d/d\lambda){\mmb{\theta } }_{\lambda }$ and $G_{AA}$ is the submatrix of $G$ corresponding to indices in $A$.

Due to (26), small changes of ${\mmb{\theta } }$ and ${\mmb{\eta } }$ are related by TeX Source $$d{\mmb{\eta } } = G d {\mmb{\theta } }.\eqno{\hbox{(74)}}$$ Since $d{\mmb{\theta } }^{opt, \bar {A}}_{\lambda }=0$ on the path, the $A$-part of $d{\mmb{\eta } }^{opt, A}_{\lambda }$ is TeX Source $$d{\mmb{\eta } }^{opt, A}_{\lambda } = G_{AA}d{\mmb{\theta } }^{opt, A}_{\lambda},\eqno{\hbox{(75)}}$$ or TeX Source $$d{\mmb{\theta } }^{opt, A}_{\lambda } = \left (G_{AA}\right)^{-1}d{\mmb{\eta } }^{opt, A}_{\lambda }.\eqno{\hbox{(76)}}$$ Hence, by differentiating (68), we have (72), and by differentiating (69), we have (73). $\hfill \square$

We study the angle between ${\mathdot {\mmb{\theta }}}^{opt}_{\lambda }$ and the coordinate axis $\theta _{i}$ of which the tangent vector is ${\mbi{e} }_{i}$. The cosine of the angle is given by the inner product $\langle{\mathdot {\mmb{\theta }}}^{opt}_{\lambda }, {\mbi{e} }_{i} \rangle$ divided by $\Vert{\mathdot {\mmb{\theta }}}^{opt}_{\lambda }\Vert$. We observe the following remarkable characteristic feature of the solution path inside a subface, which is a characteristic of LARS [10] and the extended LARS [12].

The solution path ${\mmb{\theta } }^{opt}_{\lambda }$ has the least equiangular property that ${\mathdot {\mmb{\theta }}}^{opt}_{\lambda }$ has the same angle to any $\theta _{i}$-coordinates belonging to $A$ and the angles to the other $\theta _{i}$ coordinates $\left (i \in \bar {A} \right)$ are larger.

The inner product of ${\mathdot {\mmb{\theta }}}^{opt}_{\lambda }$ and ${\mbi{e} }_{i}$ is given by TeX Source $$\langle {\mathdot {\mmb{\theta }}}^{opt}_{\lambda }, {\mbi{e} }_{i} \rangle=\left (G {\mathdot {\mmb{\theta }}}^{opt}_{\lambda }\right)^{\prime} {\mbi{e} }_{i}= {\mathdot {\mmb{\eta }}}^{opt \prime }_{\lambda }{\mbi{e} }_{i}.\eqno{\hbox{(77)}}$$ From TeX Source $$\eqalignno{{\mathdot {\eta}}^{opt}_{\lambda, i} =&\, -\lambda s^{A}_{i},\quad {\rm for} \quad i \in A, & {\hbox{(78)}}\cr{\mathdot {\eta}}^{opt}_{\lambda, i} \in&\, -\lambda [{-1, 1}], \quad{\rm for}\quad i \in \bar {A},& {\hbox{(79)}}}$$ we have TeX Source $$\left \vert \langle {\mathdot {\mmb{\theta }}}^{opt}_{\lambda }, {\mbi{e} }_{i} \rangle \right \vert =\lambda \geq \left \vert \langle {\mathdot {\mmb{\theta }}}^{opt}_{\lambda },{\mbi{e} }_{j} \rangle \right \vert ,\quad i \in A, j \in \bar {A}.\eqno{\hbox{(80)}}$$$\hfill \square$

The above theorem characterizes the solution path ${\mmb{\theta } }^{opt}_{\lambda }$. See Fig. 8. The direction changes discontinuously when the active set $A$ is altered. It occurs when some $\theta _{i} = 0$ becomes $\theta _{i} \ne 0$ or $\theta _{i} \ne 0$ becomes $\theta _{i} = 0$.

SECTION VII

The solution path ${\mmb{\theta } }^{opt}_{\lambda }$ is understood from the point of view of gradient descent of function $\varphi ({\mmb{\theta } })$ or $D^{\ast } \left [{\mmb{\theta } }^{opt}: {\mmb{\theta } }\right]$. This characterizes LARS, which starts at ${\mmb{\theta } }=0$ and approaches ${\mmb{\theta } }^{opt}$ as $\lambda$ decreases. To this end, we introduce a new notion of Minkovskian gradient.

We first define a generalized gradient in a manifold in which a Minkovskian norm is given. (see, e.g., [19] for the Minkovskii and Finsler space.) The ordinary gradient of $f({\mmb{\theta } })$ in a Euclidean space is TeX Source $${\mbi{a} } = \nabla f({\mmb{\theta } }) = \left ({{\partial }\over {\partial \theta _{1}}}f,\ldots ,{{\partial }\over {\partial \theta _{p}}}f \right),\eqno{\hbox{(81)}}$$ representing the steepest direction of $f({\mmb{\theta } })$ provided ${\mmb{\theta } }$ is orthonormal. Other wise, the steepest direction is given by the natural gradient $G^{-1} \nabla f$. We consider how $f(\mmb{\theta })$ changes, when ${\mmb{\theta } }$ changes from ${\mmb{\theta } }$ in direction ${\mbi{a} }$ in a Minkovskian space, where the norm of ${\mbi{a} }$ is given by the $L_{q}$-norm, TeX Source $$F_{q}({\mbi{a} }) = {{1}\over {q}} \sum\limits \left \vert a_{i} \right \vert ^{q}, \quad q>1.\eqno{\hbox{(82)}}$$ The steepest direction ${\mbi{a} }$ of $f({\mmb{\theta } })$ at ${\mmb{\theta } }$ is defined by TeX Source $${\mbi{a} } = {\mathop {\lim }_{\varepsilon \rightarrow 0}} \arg \max \left\vert f({\mmb{\theta } }+ \varepsilon {\mbi{a} })\right \vert\eqno{\hbox{(83)}}$$ under the condition TeX Source $$F_{q} ({\mbi{a} }) = 1.\eqno{\hbox{(84)}}$$ By the variational method, we have TeX Source $${{\partial }\over {\partial {\mbi{a}}}} \left \{\nabla f({\mmb{\theta } }) \cdot {\mbi{a} } -\lambda F_{q} ({\mbi{a} })\right \} =0,\eqno{\hbox{(85)}}$$ for obtaining the steepest direction ${\mbi{a} }$, which gives TeX Source $${{\partial }\over {\partial a_{i}}} F_{q} ({\mbi{a} }) \propto{{\partial }\over {\partial \theta _{i}}}f({\mmb{\theta } }).\eqno{\hbox{(86)}}$$ Hence, TeX Source $$a_{i} = \left [ c \left \vert {{\partial }\over {\partial \theta _{i}}}f({\mmb{\theta } })\right \vert \right ]^{{1}/{(q-1)}} {\rm sgn}\left ({{\partial }\over {\partial \theta _{i}}} f ({\mmb{\theta } }) \right),\eqno{\hbox{(87)}}$$ where $c$ is a constant. In the Euclidean case of $q=2$, this gives the ordinary gradient, TeX Source $${\mbi{a} } = \nabla f({\mmb{\theta } }).\eqno{\hbox{(88)}}$$

We define the Minkovskian gradient in the $L_{1}$-norm case, by taking the limit $q \rightarrow 1$. As $q \rightarrow 1$, by putting TeX Source $$c = {{1}\over {\max \left \{{{\partial }\over {\partial \theta _{i}}} f\right \}}},\eqno{\hbox{(89)}}$$$a_{i}$ becomes 0 except for those $i$ of which the absolute values of $(\partial / \partial \theta _{i})f$ are maximal. Hence, we have TeX Source $$\eqalignno{& a_{i} \cr&= \!\cases{\! {\rm sgn} \!\left (\!{{\partial }\over {\partial \theta _{i}}} f\! \right)\!, \!& for $i$ such that $\left \vert {{\partial }\over {\partial \theta _{i}}}f \right \vert \! =\! {{\mathop {\max }_{j}}} \left \{\left \vert {{\partial}\over {\partial \theta _{j}}} f\right \vert \right \}$, \hfill \cr\! 0, \hfill & otherwise.}\cr&&{\hbox{(90)}}}$$ We call this $L_{1}$-Minkovskian gradient denoted by $\nabla ^{M}_{1}f({\mmb{\theta } })$. Note the $\left \vert a_{i}\right \vert =1$ for $i$ that gives the maximum of $\left \vert \partial / \partial \theta _{j} f \right \vert$ and $a_{i}=0$ for other $i$.

We propose the Minkovskian steepest descent algorithm, starting at ${\mmb{\theta } }=0$, of tracing the minimum of $\varphi ({\mmb{\theta } })$ as $\lambda$ decreases or equivalently as $c$ increases. Let ${\mmb{\theta } }^{(t)}$ and ${\mmb{\eta } }^{(t)}$ be the current values in the respective coordinates. Then, the next ${\mmb{\theta } }^{(t+1)}={\mmb{\theta } }^{(t)}+ \Delta {\mmb{\theta } }^{(t)}$ is given by the $L_{1}$-Minkovskian gradient TeX Source $$\Delta {\mmb{\theta } }^{(t)} = -\Delta t \nabla ^{M}_{1}\varphi \left ({\mmb{\theta } }^{(t)}\right),\eqno{\hbox{(91)}}$$ where $\Delta t$ is an adequate step-size. ${\mmb{\eta } }^{(t+1)}$ is obtained from ${\mmb{\theta } }^{(t+1)}$. The following is the algorithm, which is the same as LARS when $\varphi ({\mmb{\theta } })$ is a quadratic function. This is the reverse of the path pursuit of Hirase and Komaki [12] which starts at ${\mmb{\theta } }^{opt}$.

Begin with ${\mmb{\theta } }^{(0)}=0$, ${\mmb{\eta } }^{(0)}=\nabla \varphi \left ({\mmb{\theta } }^{(0)}\right)$.

For $t=0, 1, 2,\ldots ,$ calculate ${\mmb{\eta } }^{(t)}= \nabla \varphi \left ({\mmb{\theta } }^{(t)}\right)$ and determine the maximum of $\left \vert \eta ^{(t)}_{1}\right \vert ,\ldots, \left \vert \eta ^{(t)}_{p} \right \vert$. Let TeX Source $${\mathop {\arg \max }_{j}} \left \vert \eta ^{(t)}_{i} \right \vert =\left \vert \eta ^{(t)}_{i^{\ast }_{1}}\right \vert\cdots = \left \vert \eta ^{(t)}_{i^{\ast }_{k}}\right \vert.\eqno{\hbox{(92)}}$$ Then, the active set is determined by TeX Source $$A^{(t)} = \left \{i^{\ast }_{1},\ldots, i^{\ast }_{k} \right \}.\eqno{\hbox{(93)}}$$ The $L_{1}$-Minkovskian gradient is TeX Source $$\nabla ^{M}_{1} \varphi \left ({\mmb{\theta } }^{(t)}\right)_{i} =\cases{{\rm sgn} \left (\theta ^{(t)}_{i} \right), \hfill & $i \in A^{(t)}$, \hfill \cr 0, \hfill & $i \in \bar {A}^{(t)}$,\hfill }\eqno{\hbox{(94)}}$$

Solve the equation of solution path stepwise, changing the current ${\mmb{\theta } }^{(t)}$ to TeX Source $${\mmb{\theta } }^{(t)} \rightarrow {\mmb{\theta } }^{(t+1)}={\mmb{\theta } }^{(t)} + \Delta {\mmb{\theta } }^{(t)},\eqno{\hbox{(95)}}$$ by TeX Source $$\eqalignno{\Delta {\mmb{\theta } }^{(t)A} =&\, G_{AA}^{-1}({\mmb{\theta } }^{(t)}){\mbi{s} }^{A} \Delta t,& {\hbox{(96)}}\cr\Delta {\mmb{\theta } }^{(t) \bar {A}}=&\, 0. & {\hbox{(97)}}}$$ This can be rewritten in terms of the ${\mmb{\eta } }$-coordinates as TeX Source $$\eqalignno{{\mmb{\eta } }^{(t+1)A} =&\, {\mmb{\eta } }^{(t)A} + {\mbi{s} }^{A} \Delta t, & {\hbox{(98)}}\cr{\mmb{\eta } }^{(t+1) \bar {A}} =&\, {\mmb{\eta } }^{(t)\bar {A}} +G_{\bar {A}A} \Delta {\mmb{\theta } }^{(t)A}.& {\hbox{(99)}}}$$

Check if $\eta ^{(t+1)}_{i}$, $i \in\bar {A}$ becomes equal to $\left \vert \eta ^{(t+1)}_{j} \right \vert$, $j \in A$. If this occurs for $j^{\ast } \in \bar {A}$, add it to the active set, forming a new active set TeX Source $$A^{(t+1)} = \left \{i^{\ast }_{i},\ldots, i^{\ast }_{k}, j^{\ast } \right \}.\eqno{\hbox{(100)}}$$

We now compare our Minkovskian gradient method with the original LARS [4], [10] and the extended LARS by Hirose and Komaki [12]. As can easily be seen, our algorithm is exactly the same as LARS and is applicable to a general convex $\varphi ({\mmb{\theta } })$. Therefore, it is an extended LARS. When $\varphi ({\mmb{\theta } })$ is a quadratic function, $M$ is Euclidean so that an ${\mmb{\eta } }$-geodesic is a ${\mmb{\theta } }$-geodesic at the same time. Hence, we do not need step-wise calculations of ${\mmb{\theta } }^{(t)}$, ${\mmb{\eta } }^{(t)}$ while $A$ is fixed. We use a large $\Delta t$ until the next ${\mmb{\eta } }^{(t+1)}$ is a turning point of the solution path where $A$ changes. This is a merit of LARS procedure when $\varphi$ is quadratic.

The extended LARS [12] begins with ${\mmb{\theta } }^{opt}$, and traces the solution path (96) and (97) in the opposite direction by changing the sign of (96). This direction is equi-angular and is hence the opposite of the Minkovskian gradient. When some $\theta _{i}$, say $\theta _{i^{\ast}}$ becomes 0, index $i^{\ast }$ is ruled out from the active set. The index $i^{\ast }$ is determined by using the Pythagorean theorem of the ${\mmb{\eta } }$-projection. The procedure continues until ${\mmb{\theta } }$ reach the origin, ${\mmb{\theta } }=0$. Therefore, its solution path traces from ${\mmb{\theta } }^{opt}$ to 0, which is exactly the same as ours except that ours traces from 0 to ${\mmb{\theta } }^{opt}$. It should be remarked that the algorithm [12] cannot be applied to the under-determined case where no unique ${\mmb{\theta } }^{opt}$ exists. Our extended LARS works perfectly even in this case, as will be shown in the next section. We summarize these in the following theorem.

The extended LARS is a gradient descent method based on the $L_{1}$-Minkovskian gradient.

SECTION VIII

We have so far assumed that $\varphi ({\mmb{\theta } })$ is strictly convex. This implies that there exists a unique ${\mmb{\theta } }^{opt}$ satisfying TeX Source $${\mmb{\eta } }^{opt} = \nabla \varphi \left ({\mmb{\theta } }^{opt}\right) = 0\eqno{\hbox{(101)}}$$ and $G= \nabla \nabla \varphi$ is a full-rank positive-definite matrix. However, the Minkovskian gradient method works even in the under-determined case. This is because the optimality condition (49) is the same, and the least equi-angle Theorem holds in the same way. Note that $\varphi ({\mmb{\theta } })$ is not strictly convex in the under-determined case so that the optimal solution TeX Source $$\nabla \varphi ({\mmb{\theta } }) = 0\eqno{\hbox{(102)}}$$ is not unique but forms a $k$-dimensional submanifold. In the regression case, $k=p-n$. We need to assume that any $m \times m$ principal submatrix of $G$ is not degenerate when $m \leq k$, which holds generically.

We show that the Minkovskian gradient algorithm works in the under-determined case. We have ${\mmb{\eta } }= \nabla \varphi({\mmb{\theta } })$, but we cannot have ${\mmb{\theta } }$ from ${\mmb{\eta } }$ uniquely in the under-determined case. By differentiating the Lagrangean problem (3), the optimal solution satisfies (49). We further differentiate it with respect to $\lambda$, obtaining TeX Source $$\eqalignno{G_{AA} {\mathdot {\mmb{\theta }}}^{opt, A}_{\lambda } =&\, -{\mbi{s} }_{A}, & {\hbox{(103)}}\cr{\mathdot {\mmb{\theta }}}^{opt, \bar {A}}_{\lambda } =&\, 0,& {\hbox{(104)}}}$$ while the active set is $A$ is fixed. When $\left \vert A\right \vert \leq k$, (96) and (97) or equivalently (98) and (99) hold, since $G_{AA}$ is of full rank. Therefore, the Minkovskian gradient method works well, until the optimal solution is obtained, of which the number of non-zero components of ${\mmb{\theta } }^{(t)}$ is at most $k$.

The algorithm proceeds in terms of ${\mmb{\eta } }^{(t)}_{\lambda }$ and $A^{(t)}$, but it is possible to obtain ${\mmb{\theta } }^{(t)}$ from ${\mmb{\eta } }^{(t)}_{\lambda }$ and $A^{(t)}$ by solving TeX Source $$\eqalignno{{\mmb{\eta } }^{(t)}_{\lambda } =&\, \nabla \varphi\left ({\mmb{\theta } }^{(t)}_{\lambda }\right), & {\hbox{(105)}}\cr{\mmb{\theta } }_{\lambda }^{(t)\bar {A}} =&\, 0.& {\hbox{(106)}}}$$ The equations are solvable when any $m \times m$ principal submatrix of $G~(m \leq k)$ is non-degenerate.

SECTION IX

We searched for the solution path ${\mmb{\theta } }^{opt}_{\lambda }$ as $\lambda$ changes, but did not consider a consistent estimator which might be obtained by choosing $\lambda$ adequately. In order to discuss an efficient consistent estimator, the oracle properties, identifying the true active set $A$ without losing the optimal convergence rate as $n$ tends to infinity, are proposed by [20]. It is known that the LASSO does not satisfy the oracle properties, but the adaptive LASSO proposed by [21] satisfies them. The adaptive LASSO uses the weighted $L_{1}$ constraint TeX Source $$F({\mmb{\theta } }, {\mbi{w} }) = \sum\limits w_{i} \left \vert \theta _{i} \right\vert ,\eqno{\hbox{(107)}}$$ and modifies ${\mbi{w} }$ adaptively in a data-dependent way, TeX Source $${\mathhat {w}}_{i} = \left \vert {\mathhat {\theta}}_{i} \right \vert ^{-\gamma }, \qquad\gamma >0,\eqno{\hbox{(108)}}$$ where ${\mathhat {\mmb{\theta }}}$ is a consistent estimator.

The present paper does not search for consistent estimators but studies the solution path ${\mmb{\theta } }^{opt}_{\lambda }$, which includes various degrees of sparse solutions. However, it is interesting to study the geometry of the weighted $L_{1}$ constraint problem. When ${\mbi{w} }$ is fixed, let $W$ be the diagonal matrix with diagonal entries $w_{i}$. We rescale ${\mmb{\theta } }$ and ${\mbi{x} }_{a}$ by TeX Source $$\eqalignno{{\mathtilde {\mmb{\theta }}} =&\, W{\mmb{\theta } }, & {\hbox{(109)}}\cr{\mathtilde {\mbi{x}}}_{a} =&\, {\mbi{x} }_{a} W^{-1}.& {\hbox{(110)}}}$$ Then, the constraint is $L_{1}$ in terms of ${\mathtilde {\mmb{\theta }}}$, TeX Source $${\mathtilde {F}}\left ({\mathtilde {\mmb{\theta }}}\right) = \sum\limits \left\vert {\mathtilde {\theta}}_{i}\right \vert ,\eqno{\hbox{(111)}}$$ where the target function is TeX Source $${\mathtilde {\varphi}} \left ({\mathtilde {\mmb{\theta }}}\right) = \varphi\left (W^{-1}{\mmb{\theta } }\right),\eqno{\hbox{(112)}}$$ in particular TeX Source $${\mathtilde {\varphi}} \left ({\mathtilde {\mmb{\theta }}}\right) = {{1}\over{2}}\left \vert {\mbi{y} }- XW^{-1} {\mathtilde {\mmb{\theta }}}\right\vert ^{2}\eqno{\hbox{(113)}}$$ in the quadratic case.

The Riemannian metric changes to TeX Source $${\mathtilde {G}} = {\mathtilde {\nabla}} {\mathtilde {\nabla}} \varphi\left ({\mathtilde {\mmb{\theta }}}\right)= W^{-1}G W^{-1},\eqno{\hbox{(114)}}$$ where ${\mathtilde {\nabla}}=\left (\partial / \partial {\mathtilde {\theta}}_{i}\right)$. Hence, the problem is formulated in the same way in ${\mathtilde {\mmb{\theta }}}$, but the geometry changes from $G$ to ${\mathtilde {G}}$. The adaptive LASSO implies an adaptive selection of the geometry.

Instead of the adaptive ${\mathhat {\mbi{w}}}$ in (108), we may consider the case where the weight vector is given by TeX Source $$w_{i} = \theta ^{-\gamma }_{i}, \qquad 0< r< 1.\eqno{\hbox{(115)}}$$ The constraint is then given by TeX Source $$F({\mmb{\theta } }) = \sum\limits \left \vert \theta _{i} \right \vert ^{1-\gamma },\eqno{\hbox{(116)}}$$ so that the constraint is $L_{p}$, $p=1-\gamma$. The problem is non-convex optimization in this case. It is interesting to compare the adaptive LASSO which is solved in the framework of convex optimization with the nonconvex $L_{p}$ problem, $0< p< 1$. Information geometry is useful for studying the problem in this case, too.

SECTION X

We have elucidated the information-geometrical properties of convex optimization problems under the $L_{1}$ norm constraint, following the ideas of Hirose and Komaki [12] in more details. The two dually coupled affine coordinates play a fundamental role, although the parameter space is not Euclidean but Riemannian, except for the case of optimization of a quadratic function. We proposed the Minkovskian gradient method which reverses the procedure proposed by [12]. This is an extension of the original LARS. It is numerically robust because it is a gradient descent method. The method can be applied even to the under-determined case, where the target function is not strictly convex. It is interesting to generalize the current approach to the $L_{1/2}$-regularization problem, see, Xu, Chang, Xu and Zhang [22], [23]; Yukawa and Amari [24].

The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Shiro Ikeda.

S.-I. Amari is with the RIKEN Brain Science Institute, Saitama 351-0198, Japan (e-mail: amari@brain.riken.jp).

M. Yukawa is with the Department of Electrical and Electronic Engineering, Niigata University, Niigata 950-2181, Japan (e-mail: yukawa@eng.niigata-u.ac.jp).

No Data Available

No Data Available

None

No Data Available

- This paper appears in:
- No Data Available
- Issue Date:
- No Data Available
- On page(s):
- No Data Available
- ISSN:
- None
- INSPEC Accession Number:
- None
- Digital Object Identifier:
- None
- Date of Current Version:
- No Data Available
- Date of Original Publication:
- No Data Available

Normal | Large

- Bookmark This Article
- Email to a Colleague
- Share
- Download Citation
- Download References
- Rights and Permissions