On the Convergence Proof of AMSGrad and a New Version

SECTION I.

Introduction and Our Contributions

One of the most popular algorithms for training deep neural networks is stochastic gradient descent (SGD) [1] and its variants. Among the various variants of SGD, the algorithm with the adaptive moment estimation Adam [2] is widely used in practice. However, Reddi et al. [3] have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad to solve this issue.

Our contribution. In this paper, we point out a flaw in the convergence proof of AMSGrad, recalled as Theorem A below. We then fix this flaw by providing a new convergence proof for AMSGrad in the case of special parameters. In addition, in the case of general parameters, we propose a new and slightly modified version of AMSGrad.

To provide more details, let us recall AMSGrad in Algorithm 1, in which the mathematical notation can be fully found in Section II.

SECTION Algorithm 1

AMSGrad (Reddi et al. [3])

Input:

$x_{1}\in \mathcal F$ , step size $\{\alpha _{t}\}_{t=1}^{T}, \{\beta _{1,t}\}_{t=1}^{T}, \beta _{2}$

Set $m_{0} = 0, v_{0} = 0$ , and $\hat v_{0} = 0$

for $(t=1; t\le T; t\gets t+1)$ do

$g_{t} = \nabla f_{t}(x_{t})$

$m_{t} = \beta _{1,t}\cdot m_{t-1} + (1-\beta _{1,t})\cdot g_{t}$

$v_{t} = \beta _{2}\cdot v_{t-1} + (1-\beta _{2})\cdot g^{2}_{t}$

$\hat v_{t} = \max (\hat v_{t-1}, v_{t})$ and $\hat V_{t} = \text {diag}(\hat v_{t})$

$x_{t+1} = \prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot m_{t}/\sqrt {\hat v_{t}})$

end for

The main theorem for the convergence of AMSGrad in [3] is as follows. To simplify the notation, we define $g_{t} \mathop{=}\limits^{\Delta } \nabla f_{t}(x_{t})$ , $g_{t,i}$ as the $i^{\text {th}}$ element of $g_{t}$ and $g_{1:t,i} \in \mathbb R^{t}$ as a vector that contains the $i^{\text {th}}$ dimension of the gradients over all iterations up to $t$ , namely, $g_{1:t,i} = [g_{1,i}, g_{2,i},\ldots ,g_{t,i}]$ .

Theorem A [Theorem 4 in[3], problematic]:

Let $x_{t}$ and $v_{t}$ be the sequences obtained from Algorithm 1, $\alpha _{t} = \frac {\alpha }{\sqrt {t}}$ , $\beta _{1} = \beta _{1,1}$ , $\beta _{1,t} \le \beta _{1}$ for all $t\in [T]$ and $\frac {\beta _{1}}{\sqrt {\beta _{2}}} \le 1$ . Assume that $\mathcal F$ has bounded diameter $D_{\infty }$ and $\lVert {\nabla f_{t}(x)}\rVert _{\infty } \le G_{\infty }$ for all $t\in [T]$ and $x\in \mathcal F$ . For $x_{t}$ generated using AMSGrad (Algorithm 1), we have the following bound on the regret:

$\begin{align*} R(T)\le&\frac {D_{\infty }^{2}\sqrt {T}}{\alpha (1-\beta _{1})}\sum \limits _{i=1}^{d} \sqrt {\hat {v}_{T,i}} \\[-2pt]&+\,\frac {D_{\infty }^{2}}{2(1-\beta _{1})} \sum \limits _{i=1}^{d} \sum _{t=1}^{T}\frac {\beta _{1,t}\sqrt {\hat v_{t,i}}}{\alpha _{t}} \\[-2pt]&+\,\frac {\alpha \sqrt { 1+\ln T}}{(1-\beta _{1})^{2}(1-\gamma)\sqrt {1-\beta _{2}}} \sum \limits _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}$ View Source

In their proof for Theorem A, Reddi et al. resolved an issue on the so-called telescopic sum in the convergence proof of Adam ([2, Theorem 10.5]). Specifically, Reddi et al. adjusted $\hat v_{t}$ such that

$\begin{equation*} {\frac {\sqrt {\hat v_{t+1,i}}}{\alpha _{t+1}} \ge \frac {\sqrt {\hat v_{t,i}}}{\alpha _{t}} }\tag{1}\end{equation*}$ View Source

for all

$i \in [d]$

. However, there is another issue (showed in Section III) in the convergence proof of Adam that AMSGrad unfortunately neglects. The issue affects both the correctness of Reddi et al.’s proof and the upper bound for the regret in Theorem A. To deal with the issue in a general way, we propose to modify Algorithm 1 such that

$\begin{equation*} {\frac {\sqrt {\hat {v}_{t+1,i}}}{\alpha _{t+1}(1-\boxed {\beta _{1, t+1}})} \ge \frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}(1-\boxed {\beta _{1, t}})}}\end{equation*}$

View Source

for all

$i \in [d]$

. The differences with (1) are highlighted in the boxes for clarity.

Paper roadmap. We begin with preliminaries in Section II. We show where the proof of Theorem A becomes invalid in Section III. After that, we suggest two ways to resolve the issue in Sections IV and V.

SECTION II.

Preliminaries

Notation. Given a sequence of vectors $\{x_{t}\}_{1\le t\le T} (1\le T\in \mathbb N)$ in $\mathbb R^{d}$ , we denote its $i^{\text {th}}$ coordinate by $x_{t,i}$ and use $x_{t}^{k}$ to denote the elementwise power of $k$ and $\lVert {x_{t}}\lVert _{2}$ , resp. $\lVert {x_{t}}\lVert _{\infty }$ , to denote its $\ell _{2}$ -norm, resp. $\ell _{\infty }$ -norm. Let $\mathcal F \subseteq \mathbb R^{d}$ be a feasible set of points such that $\mathcal F$ has bounded diameter $D_{\infty }$ , that is, $\lVert {x-y}\rVert _{\infty } \le D_{\infty }$ for all $x,y\in \mathcal F$ , and $\mathcal S^{d}_{+}$ denote the set of all positive definite $d\times d$ matrices. For a matrix $A\in \mathcal S^{d}_{+}$ , we denote $A^{1/2}$ for the square root of $A$ . The projection operation $\prod _{\mathcal F, A} (y)$ for $A \in \mathcal S^{d}_{+}$ is defined as $\mathrm {argmin}_{ x\in \mathcal F}\lVert {A^{1/2} (x-y)}\lVert _{2}$ for all $y \in \mathbb R^{d}$ . When $d=1$ and $\mathcal F \subset \mathbb R$ , the positive definite matrix $A$ is a positive number, so that the projection $\prod _{\mathcal F, A} (y)$ becomes $\mathrm {argmin}_{x\in \mathcal F}|x-y|$ . We use $\langle x, y \rangle$ to denote the inner product between $x$ and $y\in \mathbb R^{d}$ . The gradient of a function $f$ evaluated at $x\in \mathbb R^{d}$ is denoted by $\nabla f(x)$ . For vectors $x, y\in \mathbb R^{d}$ , we use $\sqrt {x}$ or $x^{1/2}$ for element-wise square root, $x^{2}$ for element-wise square, $x/y$ to denote element-wise division. For an integer $n\in \mathbb N$ , we denote by $[n]$ the set of integers $\{1,2,\ldots ,n\}$ .

Optimization setup. Let $f_{1}, f_{2},\ldots , f_{T}: \mathbb R^{d} \to \mathbb R$ be an arbitrary sequence of convex cost functions and $x_{1}\in \mathbb R^{d}$ . At each time $t\ge 1$ , the goal is to predict the parameter $x_{t}$ and evaluate it on a previously unknown cost function $f_{t}$ . Since the nature of the sequence is unknown in advance, the algorithm is evaluated by using the regret, that is, the sum of all the previous differences between the online prediction $f_{t}(x_{t})$ and the best fixed-point parameter $f_{t}(x^{*})$ from a feasible set $\mathcal F$ for all the previous steps. Concretely, the regret is defined as

$\begin{equation*} R(T) = \sum _{t=1}^{T}[f_{t}(x_{t}) -f_{t}(x^{*})],\end{equation*}$ View Source

where

$x^{*} = \mathrm {argmin}_{x\in \mathcal F}\sum _{t=1}^{T}f_{t}(x)$

.

Definition 2.1:

A function $f: \mathbb R^{d} \rightarrow \mathbb R$ is convex if for all $x, y\in \mathbb R^{d}$ , and all $\lambda \in [{0,1}]$ ,

$\begin{equation*} \lambda f(x) + (1-\lambda)f(y) \ge f(\lambda x + (1-\lambda)y).\end{equation*}$ View Source

Lemma 2.2:

If a function $f: \mathbb R^{d} \rightarrow \mathbb R$ is convex, then for all $x, y\in \mathbb R^{d}$ ,

$\begin{equation*} f(y) \ge f(x) + \nabla f(x)^{\sf T}(y-x),\end{equation*}$ View Source

where $\nabla f(x)^{\sf T}$ denotes the transpose of $\nabla f(x)$ .

Lemma 2.3 (Cauchy–Schwarz inequality):

For all $n\ge 1$ , $u_{i}, v_{i}\in \mathbb R (1\le i \le n)$ ,

$\begin{equation*} \left ({\sum _{i=1}^{n} u_{i}v_{i}}\right)^{2} \le \left ({\sum _{i=1}^{n}u_{i}^{2}}\right) \left ({\sum _{i=1}^{n}v_{i}^{2}}\right).\end{equation*}$ View Source

Lemma 2.4 (Taylor series):

For $\alpha \in \mathbb R$ and $0 < \alpha < 1$ ,

$\begin{equation*} \sum _{t\ge 1}{\alpha ^{t}} = \frac {1}{1-\alpha }\end{equation*}$ View Source

and

$\begin{equation*} \sum _{t\ge 1}{t\alpha ^{t-1}} = \frac {1}{(1-\alpha)^{2}}.\end{equation*}$

View Source

Lemma 2.5 (Upper bound for the harmonic series):

For $N\in \mathbb N$ ,

$\begin{equation*} \sum _{n=1}^{N} \frac {1}{n}\le \ln N +1.\end{equation*}$ View Source

Lemma 2.6:

For $N\in \mathbb N$ ,

$\begin{equation*} \sum _{n=1}^{N} \frac {1}{\sqrt {n}}\le 2\sqrt {N}.\end{equation*}$ View Source

Lemma 2.7:

For all $n\in \mathbb N$ and $a_{i}, b_{i} \in \mathbb R$ such that $a_{i}\ge 0$ and $b_{i}>0$ for all $i\in [n]$ ,

$\begin{equation*} \frac {\sum _{i=1}^{n}a_{i}}{\sum _{j=1}^{n} b_{j}} \le \sum _{i=1}^{n}\frac {a_{i}}{b_{i}}.\end{equation*}$ View Source

Lemma 2.8:

[4, Lemma 3 in arXiv version] For any $Q \in \mathcal S^{d}_{+}$ and convex feasible set $\mathcal F\subseteq \mathbb R^{d}$ , suppose $u_{1} = \mathrm {argmin}_{x\in \mathcal F}\lVert Q ^{1/2}(x-z_{1})\rVert$ and $u_{2} = \mathrm {argmin}_{x\in \mathcal F}\lVert Q ^{1/2}(x-z_{2})\rVert$ . Then, we have

$\begin{equation*} \lVert Q ^{1/2}(u_{1}-u_{2})\rVert \le \lVert Q ^{1/2}(z_{1}-z_{2})\rVert.\end{equation*}$ View Source

SECTION III.

Issue in the Convergence Proof of AMSGrad

Before showing the issue in the convergence proof of AMSGrad, let us recall and prove the following inequality, which also appears in [3].

Lemma 3.1:

Algorithm 1 achieves the following guarantee, for all $T\ge 1$ :

$\begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}((x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2})}{ 2\alpha _{t}(1-\beta _{1,t})} \\&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2}.\end{align*}$ View Source

Proof:

We note that

$\begin{align*} x_{t+1}=&\prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot {\hat V^{-1/2}_{t}}m_{t}) \\[-2.5pt]=&\mathrm {argmin}_{x\in \mathcal F}\lVert \hat V^{1/4}(x-(x_{t} - \alpha _{t} \hat V^{-1/2}m_{t}))\rVert\end{align*}$ View Source

and

$\prod _{\mathcal F, \sqrt {\hat V_{t}}}(x^{*}) = x^{*}$

for all

$x^{*} \in \mathcal F$

. For all

$1\le t \le T$

, put

$g_{t} = \nabla _{x} f_{t}(x_{t})$

. Using Lemma 2.8 with

$u_{1} = x_{t+1}$

and

$u_{2} = x^{*}$

, we have

$\begin{align*}&\hspace{-1.8pc}\rlap{\text{$\displaystyle \lVert \hat V^{1/4}(x_{t+1} - x^{*}) \rVert ^{2} $}}\qquad \\[-2.5pt]\le&\lVert \hat V^{1/4}(x_{t} - \alpha _{t} \hat V^{-1/2}m_{t} - x^{*}) \rVert ^{2} \\[-2.5pt]=&\lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} \! +\! \alpha ^{2}_{t} \lVert \hat V^{-1/4}m_{t}\rVert ^{2} \!-\! 2\alpha _{t}\langle m_{t}, x_{t}- x^{*}\rangle \\[-2.5pt]=&\lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} + \alpha ^{2}_{t} \lVert \hat V^{-1/4}m_{t}\rVert ^{2} \\[-2.5pt]&-\,2\alpha _{t}\langle \beta _{1,t}m_{t-1} + (1-\beta _{1,t})g_{t}, x_{t}- x^{*}\rangle.\end{align*}$

View Source

This yields

$\begin{align*} \langle g_{t}, x_{t}- x^{*}\rangle\le&\frac {1}{2\alpha _{t}(1-\beta _{1,t})}\Big [ \lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} \\[-2.5pt]&\qquad -\lVert \hat V^{1/4}(x_{t+1} - x^{*}) \rVert ^{2} \Big] \\[-2.5pt]&+\,\frac {\alpha _{t}}{2(1-\beta _{1,t})}\lVert \hat V^{-1/4}m_{t}\rVert ^{2} \\[-2.5pt]&-\,\frac {\beta _{1,t}}{1-\beta _{1,t}}\langle m_{t-1}, x_{t}- x^{*}\rangle.\end{align*}$

View Source

Therefore, we obtain

$\begin{align*}&\hspace{-1.2pc}\rlap{\text{$\displaystyle \sum _{i=1}^{d} g_{t,i} (x_{t,i} - {x^{*}_{i}}) $}}\qquad \\[-2.5pt]\le&\sum _{i=1}^{d} \frac {\sqrt {\hat v_{t,i}}}{2\alpha _{t}(1-\beta _{1,t}) } \Big ((x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2}\Big) \\[-2.5pt]&+\,\sum _{i=1}^{d} \frac {\alpha _{t}}{2(1-\beta _{1,t}) } \frac { m^{2}_{t,i}}{\sqrt {\hat v_{t,i}}} - \sum _{i=1}^{d}\frac {\beta _{1,t}}{1 - \beta _{1,t}}m_{t-1,i}(x_{t,i} - {x^{*}_{i}}). \\[-2.5pt]\tag{2}\end{align*}$

View Source

Moreover, by Lemma 2.2, we have

$f_{t}(x^{*}) -f_{t}(x_{t}) \ge g_{t}^{\text T} (x^{*}-x_{t})$

, where

$g_{t}^{\sf T}$

denotes the transpose of vector

$g_{t}$

. This means that

$\begin{equation*} f_{t}(x_{t}) - f_{t}(x^{*}) \le g_{t}^{\sf T}(x_{t}-x^{*}) = \sum _{i=1}^{d} g_{t,i}(x_{t,i}-{x^{*}_{i}}).\end{equation*}$

View Source

Hence,

$\begin{align*} R(T)=&\sum _{t=1}^{T} [f_{t}(x_{t}) - f_{t}(x^{*})] \\[-2.5pt]\le&\sum _{t=1}^{T}g_{t}^{\sf T} (x_{t} - x^{*}) = \sum _{t=1}^{T}\sum _{i=1}^{d}g_{t,i} (x_{t,i} - {x^{*}_{i}}).\tag{3}\end{align*}$

View Source

Combining (2) with (3), we obtain

$\begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{2\alpha _{t}(1-\beta _{1,t}) } ((x_{t,i}\! -\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2}) \\[-2.5pt]&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t}) } \frac { m^{2}_{t,i}}{\sqrt {\hat v_{t,i}}} \\[-2.5pt]&+\,\sum _{i=1}^{d} \sum _{t=2}^{T} \frac {\beta _{1,t}}{1-\beta _{1,t}}m_{t-1,i}({x^{*}_{i}} - x_{t,i}).\end{align*}$

View Source

where the last term is from the setting that

$m_{0} = 0$

. On the other hand, for all

$t\ge 2$

, we have

$\begin{align*} m_{t-1,t}({x^{*}_{i}} \!-\! x_{t,i})=&\frac {(\hat {v}_{t-1,i})^{1/4}}{\sqrt {\alpha _{t-1}}} ({x^{*}_{i}} \!-\! x_{t,i}) \sqrt {\alpha _{t-1}} \frac {m_{t-1,i}}{(\hat {v}_{t-1,i})^{1/4}} \\\le&\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}} (x_{t,i} \!-\! {x^{*}_{i}})^{2} + {\alpha _{t-1}} \frac {m^{2}_{t-1,i}}{2\sqrt {\hat {v}_{t-1,i}}},\end{align*}$

View Source

Therefore, we obtain

$\begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1,t}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right) \\&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1-\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t})} (x_{t,i} - {x^{*}_{i}})^{2}.\tag{4}\end{align*}$

View Source

Moreover, we have

$\begin{align*} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1\!-\!\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}}=&\sum _{i=1}^{d} \sum _{t=1}^{T-1}\frac {\beta _{1,t+1}\alpha _{t}}{2(1\!-\!\beta _{1,t+1})} \frac {m^{2}_{t,i}}{\sqrt {\hat {v}_{t,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1\!-\!\beta _{1,t+1})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}},\end{align*}$

View Source

where the last inequality is from the assumption that

$\beta _{1,t} \le \beta _{1} < 1 (1\le t\le T)$

. Therefore,

$\begin{align*}&\hspace{-1.8pc}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} + \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1-\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}.\tag{5}\end{align*}$

View Source

Hence, from (4) and (5) we have

$\begin{align*} R(T) \le & \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1,t}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right)\\ \quad &+ \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}\\ \quad & + \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2},\end{align*}$

View Source

where the last term is from the property that

$\beta _{1,t} \le \beta _{1} (1\le t \le T)$

.

Issue in the convergence proof of AMSGrad. We denote the terms on the right hand-side of the upper bound for $R(T)$ in Lemma 3.1 as

$\begin{align*} &\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1\!-\!\beta _{1,t}) }\left ({(x_{t,i} \!- \!{x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right),\tag{6}\\&\qquad \qquad\quad\quad \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}},\tag{7}\end{align*}$ View Source

and

$\begin{equation*} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2}.\tag{8}\end{equation*}$

View Source

The issue in the proof of the convergence theorem of AMSGrad [3, Theorem 4] becomes on examining the term (6). Indeed, in [3, page 18], Reddi et al. used¹ the property that $\beta _{1,t} \le \beta _{1}$ , and hence

$\begin{equation*} \frac {1}{1-\beta _{1,t}} \leq \frac {1}{1-\beta _{1}},\end{equation*}$ View Source

to replace all

$\beta _{1,t}$

by

$\beta _{1}$

as

$\begin{align*} (6)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right) \\\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\frac {1}{2(1\!-\!\beta _{1})}\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}} \!-\! \frac {\sqrt {\hat {v}_{t-1,i}}}{\alpha _{t-1}} }\right).\end{align*}$

View Source

However, the first inequality (in red) is not guaranteed because the quantity

$\begin{equation*}(x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2}\end{equation*}$

View Source

in (6) may be both negative and positive as shown in Counter-example 3.2. This is also a neglected issue in the convergence proofs in Kingma and Ba [2, Theorem 10.5], Luo et al. [5, Theorem 4], Bock et al. [6, Theorem 4.4], and Chen and Gu [7, Theorem 4.2].

Counter-example III.2 (For AMSGrad Convergence Proof): We use the function in the Synthetic Experiment of Reddi et al. [3, Page 6]

$\begin{equation*} f_{t}(x)=\begin{cases}{1010 x,} & t~{{~\text {mod }} 101=1} \\ {-10 x,} &~{{~\text {otherwise }}}\end{cases}\end{equation*}$ View Source

with the constraint set

$\mathcal F = [-1,1]$

. The optimal solution is

$x^{*}=-1$

. By the proof of [3, Theorem 1], the initial point

$x_{1} = 1$

. By Algorithm 1,

$m_{0} = 0$

,

$v_{0} = 0$

, and

$\hat v_{0} = 0$

. We choose

$\beta _{1} = 0.9$

,

$\beta _{1,t} = \beta _{1}\lambda ^{t-1}$

, where

$\lambda = 0.001$

,

$\beta _{2} = 0.999$

, and

$\alpha _{t} = \alpha /\sqrt {t}$

, where

$\alpha = 0.001$

. Under this setting, we have

$f_{1}(x_{1}) = 1010x_{1}$

,

$f_{2}(x_{2}) = -10x_{2}$

,

$f_{3}(x_{3}) = -10x_{3}$

and hence

$\begin{align*} g_{1}=&\nabla f_{1}(x_{1}) = 1010,\\ m_{1}=&\beta _{1,1}m_{0} + (1-\beta _{1,1})g_{1} = (1-0.9)1010 = 101,\\ v_{1}=&\beta _{2}v_{0} + (1-\beta _{2})g_{1}^{2} = (1-0.999)1010^{2} = 1020.1,\\ \hat v_{1}=&\max (\hat v_{0}, v_{1}) = v_{1}.\end{align*}$

View Source

Therefore,

$\begin{align*} x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}=&1-(0.001)101/\sqrt {1020.1}\\=&0.9968377223398316.\end{align*}$

View Source

Since

$x_{1} - \alpha _{1}\,\,m_{1}/\sqrt {\hat v_{1}}>0$

, we have

$\begin{align*} x_{2}=&\prod _{\mathcal F}(x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}) \\=&\min (1, x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}) \\=&0.9968377223398316.\end{align*}$

View Source

Hence,

$\begin{equation*} (x_{1} - x^{*})^{2} - (x_{2} - x^{*})^{2} = 0.001264811064067839 >0.\end{equation*}$

View Source

At

$t=2$

, we have

$\begin{align*} g_{2}=&-10,\\ m_{2}=&\beta _{1,2}m_{1} + (1-\beta _{1,2})g_{2} \\=&(0.9)(0.001)(101) + [1-(0.9)(0.001)](-10) \\=&-9.9001,\\ v_{2}=&\beta _{2}v_{1} + (1-\beta _{2})g_{2}^{2}\\=&(0.999)(1020.1) + (1-0.999)(-10)^{2} \\=&1019.1799000000001,\\ \hat v_{2}=&\max (\hat v_{1}, v_{2}) = v_{1}\\=&1020.1.\end{align*}$

View Source

Therefore,

$\begin{align*} x_{2} \!-\! \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}=&0.9968377223398316 \!-\! \frac {0.001}{\sqrt {2}} \frac {(-9.9001)}{\sqrt {1020.1}} \\=&0.9970569034941291.\end{align*}$

View Source

Since

$x_{2} - \alpha _{2}\,\,m_{2}/\sqrt {\hat v_{2}}>0$

, we obtain

$\begin{align*} x_{3}=&\prod _{\mathcal F}(x_{2} - \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}) \\=&\min (1, x_{2} - \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}) \\=&0.9970569034941291.\end{align*}$

View Source

Hence,

$\begin{equation*} (x_{2} - x^{*})^{2} - (x_{3} - x^{*})^{2} = -0.0008753864342319062 < 0.\end{equation*}$

View Source

Outline of our solution. Let us rewrite (6) as

$\begin{align*} (6)=&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac { \sqrt {\hat {v}_{t,i}}}{2\alpha _{t}(1-\beta _{1,t})} (x_{t, i} - {x^{*}_{i}})^{2} \\&- \, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t-1})} (x_{t, i} - {x^{*}_{i}})^{2} \\&- \, \sum _{i=1}^{d}\frac {\sqrt {\hat {v}_{T,i}}}{2\alpha _{T}(1-\beta _{1,T})} (x_{T+1, i} - {x^{*}_{i}})^{2}\\\le&{\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2}} \\[-1pt]&+ \, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac { \sqrt {\hat {v}_{t,i}}}{2\alpha _{t}(1-\beta _{1,t})} (x_{t, i} - {x^{*}_{i}})^{2} \\[-1pt]&-\, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t-1})} (x_{t, i} - {x^{*}_{i}})^{2}.\end{align*}$ View Source

Therefore,

$\begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\[-1pt]&+ \,\frac {1}{2}\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \Biggl (\frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}(1-\boxed {\beta _{1,t}})} \\[-1pt]&\qquad -\, \frac {\sqrt {\hat {v}_{t-1,i}}}{\alpha _{t-1}(1\!-\!\boxed {\beta _{1,t-1}})} \Biggr), \tag{9}\end{align*}$ View Source

in which the differences with Reddi et al. [3] are highlighted in the boxes, namely,

$\boxed {\beta _{1,t}}$

and

$\boxed {\beta _{1,t-1}}$

instead of

$\beta _{1}$

.

We suggest two ways to overcome these differences depending on the setting of $\beta _{1,t} (1\le t \le T)$ :

In Section IV: If either $\beta _{1,t} \mathop{=}\limits^{\Delta } \beta _{1}\lambda ^{t-1}$ or $\beta _{1,t} \mathop{=}\limits^{\Delta }1/t$ , $(1\le t \le T)$ , where $0\le \beta _{1} < 1$ and $0 < \lambda < 1$ , then we give a new convergence theorem for AMSGrad in Section IV.
In Section V: If the setting for $\beta _{1,t} (1\le t \le T)$ is general, as in the statement of Theorem A, then we suggest a new (slightly modified) version for AMSGrad in Section V.

SECTION IV.

New Convergence Theorem for AMSGrad

When either $\beta _{1,t} \mathop{=}\limits^{\Delta } \beta _{1}\lambda ^{t-1}$ or $\beta _{1,t} \mathop{=}\limits^{\Delta }1/t (1\le t \le T)$ , where $0\le \beta _{1} < 1$ and $0 < \lambda < 1$ , Theorem A can be fixed as follows, in which the upper bounds of the regret $R(T)$ are changed.

Theorem 4.1 (Fixes for Theorem A):

Let $x_{t}$ and $v_{t}$ be the sequences obtained from Algorithm 1, $\alpha _{t} = \frac {\alpha }{\sqrt {t}}$ , either $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ , where $\lambda \in (0,1)$ , or $\beta _{1,t} = \frac {\beta _{1}}{t}$ for all $t\in [T]$ and $\gamma = \frac {\beta _{1}}{\sqrt {\beta _{\vphantom {H_{j}}2}}} \le 1$ . Assume that $\mathcal F$ has bounded diameter $D_{\infty }$ and $\lVert {\nabla f_{t}(x)}\rVert _{\infty } \le G_{\infty }$ for all $t\in [T]$ and $x\in \mathcal F$ . For $x_{t}$ generated using AMSGrad (Algorithm 1), we have the following bound on the regret. Then, there is some $1\le t_{0} \le T$ such that AMSGrad achieves the following guarantee for all $T\ge 1$ :

$\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}} \\[-1pt]&+ \,\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}$ View Source

provided $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ , and

$\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})} \\[-1pt]&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}$

View Source

provided $\beta _{1,t} = \frac {\beta _{1}}{t}$ .

To prove Theorem 4.1, we need the following Lemmas 4.2, 4.3, and 4.4.

Lemma 4.2:

$\sqrt {\hat {v}_{t}} \le G_{\infty }$ .

Proof:

From the definition of $\hat {v}_{t}$ in AMSGrad’s algorithm, it is implied that $\hat {v}_{t} = \max \{v_{1},\ldots ,v_{t}\}$ . Therefore, there is some $1\le s\le t$ such that $\hat {v}_{t} = v_{s}$ . Hence,

$\begin{align*} \sqrt {\hat {v}_{t}}=&\sqrt {v_{s}}\\=&\sqrt {1-\beta _{2}}\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}g^{2}_{k}}\\\le&\sqrt {1-\beta _{2}}\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k} (\max _{1\le {j}\le s}{|g_{j}}|)^{2}}\\=&G_{\infty }\sqrt {1-\beta _{2}} \sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}}\\\le&G_{\infty }\sqrt {1-\beta _{2}} \frac {1}{\sqrt {1-\beta _{2}}}\\=&G_{\infty },\end{align*}$ View Source

where the first inequality is by the fact that

$g_{k}\le \max _{1\le j\le s}{|g_{j}|}$

,

$k\in [s]$

, and the last inequality is by Lemma 2.4.

Lemma 4.3:

If either $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ or $\beta _{1,t} = \beta _{1}/t$ , then there exists some $t_{0}$ such that for every $t > t_{0}$ ,

$\begin{equation*}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \ge \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}.\end{equation*}$ View Source

Proof:

Since $\hat {v}_{t,i}\ge \hat {v}_{t-1,i}$ owing to $\hat v_{t} = \max (\hat v_{t-1}, v_{t})$ in Algorithm 1, it is sufficient to prove that there exists some $t_{0}$ such that for every $t > t_{0}$ ,

$\begin{equation*} \frac {\sqrt {t}}{1-\beta _{1,t}} \ge \frac {\sqrt {t-1}}{1-\beta _{1,t-1}}.\end{equation*}$ View Source

In other word,

$\begin{equation*} 1- \frac {\beta _{1,t-1}-\beta _{1,t}}{1-\beta _{1,t}} \ge \sqrt {1-\frac {1}{t}}.\tag{10}\end{equation*}$

View Source

When $\beta _{1,t} = \beta _{1}/t$ , from (10) we have

$\begin{equation*} 1- \frac {\beta _{1}}{(t-1)(t-\beta _{1})} \ge \sqrt {1-\frac {1}{t}}.\tag{11}\end{equation*}$ View Source

When $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ , (10) have the following form

$\begin{equation*} 1- \frac {(1- \lambda) \beta _{1}\lambda ^{t-2}}{1-\beta _{1}\lambda ^{t-1}} = \frac {1-\beta _{1} \lambda ^{t-2}}{1-\beta _{1} \lambda ^{t-1}} \ge \sqrt {1 - \frac {1}{t}}.\tag{12}\end{equation*}$ View Source

Since

$\beta _{1}$

and

$\lambda$

are smaller than 1, it is easy to see that when

$t$

is sufficiently large, meaning that

$t > t_{0}$

for some

$t_{0}$

, the left-hand side of (11) is

$1 - O(1/t^{2})$

and the left-hand side of (12) is larger than

$1 - \beta _{1} \lambda ^{t-2} = 1 - O(\lambda ^{t-2})$

. Therefore, (11) and (12) hold when

$t$

is sufficiently large.

Lemma 4.4:

For the parameter settings and conditions assumed in Theorem 4.1, we have

$\begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T +1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}$ View Source

Proof:

The proof is almost identical to that of [3, Lemma 2]. Owing to $\hat v_{t} = \max (\hat v_{t-1}, v_{t})$ in Algorithm 1, we have $\hat {v}_{t,i} \ge v_{t,i}$ for all $t\ge 1$ . Therefore

$\begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {m^{2}_{t,i}}{\sqrt {t {v}_{t,i}}} \\[-1.6pt]=&\frac {\left[{\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right]^{2}}{\sqrt {(1-\beta _{2})t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}. \tag{13}\end{align*}$ View Source

Moreover, by Lemma 2.3 we have

$\begin{align*}&\hspace{-1.5pc}\left({\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right)^{2} \\[-1.6pt]\le&\left ({\sum _{k=1}^{t}(1-\beta _{1,k})^{2}\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)}\right)\left ({\sum _{k=1}^{t}\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}^{2}}\right).\end{align*}$ View Source

And hence,

$\begin{align*}&\hspace{-2.8pc}\left({\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right)^{2} \\[-1.6pt]&\qquad \quad \le \, \left ({\sum _{k=1}^{t}\beta _{1}^{t-k}}\right)\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}\right),\tag{14}\end{align*}$ View Source

since

$\beta _{1,k} \le 1$

and

$\beta _{1,k} \le \beta _{1}$

for all

$1\le k\le T$

. Combining (13) and (14) we obtain

$\begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}}\right)\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}\right)}{\sqrt {(1-\beta _{2})t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}} \frac {\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}},\end{align*}$

View Source

where the last inequality is obtained by applying Lemma 2.4 to

$\sum _{k=1}^{t}\beta _{1}^{t-k}$

. Therefore,

$\begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}\sqrt {t}} \frac {\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\frac {\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1}) \sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\frac {\beta _{1}^{t-k}}{\sqrt {\beta _{2}^{t-k}}} |{g_{k,i}}|\\[-1.6pt]=&\frac {1}{(1-\beta _{1}) \sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\gamma ^{t-k} |{g_{k,i}}|,\end{align*}$

View Source

where the second inequality is by Lemma 2.7. Therefore

$\begin{equation*} \sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {1}{(1\!-\!\beta _{1})\sqrt {1\!-\!\beta _{2}}} \sum _{t=1}^{T}\frac {1}{\sqrt {t}}\sum _{k=1}^{t}{\gamma }^{t-k} |{g_{k,i}}|. \tag{15}\end{equation*}$

View Source

It is sufficient to consider

$\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|{g_{k, i}}|$

. Firstly,

$\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|{g_{k, i}}|$

can be expanded as

$\begin{align*}&\hspace{-1.5pc}\frac {1}{\sqrt {2}} \biggl (\gamma ^{1}|g_{1, i}| + \gamma ^{0}|{g_{2, i}}| \biggr)\\&+\, \frac {1}{\sqrt {3}} \biggl (\gamma ^{2}|g_{1, i}| + \gamma ^{1}|{g_{2, i}}| + \gamma ^{0}|{g_{3, i}}|\biggl)\\&+\,\cdots \\&+\,\frac {1}{\sqrt {T}} \biggl (\gamma ^{T-1}|g_{1, i}| + \gamma ^{T-2}|{g_{2, i}}| +\ldots + \gamma ^{0}|{g_{T, i}}|\biggl).\end{align*}$

View Source

Changing the role of

$|g_{1,i}|$

as the common factor, we have

$\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}|$

is equal to

$\begin{align*}&\hspace{-2.5pc}|g_{1, i}| \left({\gamma ^{0} + \frac {1}{\sqrt {2}}\gamma ^{1} + \frac {1}{\sqrt {3}}\gamma ^{2} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-1}}\right)\biggr) \\&+\, |{g_{2, i}}| \left({\frac {1}{\sqrt {2}}\gamma ^{0} + \frac {1}{\sqrt {3}}\gamma ^{1} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-2}}\right)\\&+\,|{g_{3, i}}| \left({\frac {1}{\sqrt {3}}\gamma ^{0} + \frac {1}{\sqrt {4}}\gamma ^{1} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-3}}\right)\\&+\, \cdots \\&+\,|{g_{T, i}}| \frac {1}{\sqrt {T}}\gamma ^{0}.\end{align*}$

View Source

In other words,

$\begin{equation*}\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| = \sum _{t=1}^{T} |{g_{t, i}}| \sum _{k=t}^{T}\frac {1}{\sqrt {k}}\gamma ^{k-t}\end{equation*}$

View Source

Moreover, since

$\sum _{k=t}^{T}\frac {1}{\sqrt {k}}\gamma ^{k-t} \le \sum _{k=t}^{T}\frac {1}{\sqrt {t}}\gamma ^{k-t} = \frac {1}{\sqrt {t}}\sum _{k=t}^{T}\gamma ^{k-t} = \frac {1}{\sqrt {t}}\sum _{k=0}^{T-t}\gamma ^{k} \le \frac {1}{\sqrt {t}}\left ({\frac {1}{1-\gamma }}\right)$

, where the last inequality is by Lemma 2.4, we obtain

$\begin{align*} \sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}|\le&\sum _{t=1}^{T}|g_{t, i}| \frac {1}{\sqrt {t}}\left ({\frac {1}{1-\gamma }}\right) \\=&\frac {1}{1-\gamma } \sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|.\end{align*}$

View Source

Furthermore, since

$\begin{align*} \sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|=&\sqrt {\left ({\sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|}\right)^{2}}\\\le&\sqrt {\sum _{t=1}^{T} \frac {1}{t}} \sqrt {\sum _{t=1}^{T}g_{t, i}^{2}} \\\le&(\sqrt {\ln T +1})\lVert {g_{1:T, i}}\rVert _{2},\end{align*}$

View Source

where the first inequality is by Lemma 2.3 and the last inequality is by Lemma 2.5, we obtain

$\begin{equation*}\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| \le \frac {\sqrt {\ln T +1}}{1-\gamma } \lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}$

View Source

Hence, by (15),

$\begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T + 1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2},\end{equation*}$

View Source

which ends the proof.

Let us now prove Theorem 4.1.

Proof ofTheorem 4.1:

To prove Theorem 4.1, by Lemma 3.1, we need to bound the terms (6), (7), and (8). First, we consider (7). We have

$\begin{align*} \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}=&\frac {\alpha }{1-\beta _{1}}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac { m_{t,i}^{2}}{\sqrt {t\hat v_{t,i}}} \\[4.5pt]\le&\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \\[4.5pt]&\times \, \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}, \tag{16}\end{align*}$ View Source

where the equality is by the assumption that

$\alpha _{t} = \alpha /\sqrt {t}$

and the last inequality is by Lemma 4.4. Next, we consider (8). The bound for (8) depends on either

$\beta _{1,t} = \beta _{1}\lambda ^{t-1} (0 < \lambda < 1)$

or

$\beta _{1,t} = \frac {\beta _{1}}{t}$

. Recall that by assumption,

$\lVert {x_{m}-x_{n}}\rVert _{\infty } \le D_{\infty }$

for any

$m,n\in \{1,2,\ldots ,T\}$

,

$\alpha _{t} = \alpha /\sqrt {t}$

. If

$\beta _{1,t} = \beta _{1}\lambda ^{t-1} (0 < \lambda < 1)$

, then,

$\begin{align*}&\hspace{-2.8pc}\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2} \\[4.5pt]=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1}\lambda ^{t-1}\sqrt {(t-1)}\sqrt {\hat {v}_{t-1,i}} }{2\alpha (1-\beta _{1})}(x_{t,i} - {x^{*}_{i}})^{2} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\sqrt {(t-1)} \lambda ^{t-1} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}t \lambda ^{t-1} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \frac {1}{(1-\lambda)^{2}} \\[4.5pt]=&\frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}}, \tag{17}\end{align*}$

View Source

where the first inequality is from Lemma 4.2 and the assumption that

$\beta _{1}\le 1$

, the last inequality is by Lemma 2.4. If

$\beta _{1,t} = \frac {\beta _{1}}{t}$

, then,

$\begin{align*}&\hspace{-3.5pc}\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2} \\=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1}\sqrt {(t-1)}\sqrt {\hat {v}_{t-1,i}} }{2\alpha (1-\beta _{1})t}(x_{t,i} - {x^{*}_{i}})^{2} \\\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {(t-1)}}{t} \\\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {1}{\sqrt {t}} \\=&\frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})}, \tag{18}\end{align*}$

View Source

where the first inequality is from Lemma 4.2 and the assumption that

$\beta _{1}\le 1$

, and the last inequality is by Lemma 2.6.

Finally, we will bound (6). From (9) and replacing $\alpha _{t}$ with $\frac {\alpha }{\sqrt {t}} (1\le t \le T)$ , we obtain

$\begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha (1-\beta _{1})} (x_{1, i} \!-\! {x^{*}_{i}})^{2} \\&+\,\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1\!-\!\beta _{1,t}} \!-\! \frac {\sqrt {(t\!-\!1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right).\end{align*}$ View Source

By Lemma 4.3, there is some $t_{0} (1\le t_{0} \le T)$ such that $\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \ge \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}$ for all $t>t_{0}$ . Therefore,

$\begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\[-1.5pt]&+\,\!\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}}\! -\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\[-1.5pt]&+\,\!\frac {1}{2\alpha }\!\!\sum _{i=1}^{d}\!\! \sum _{t=t_{0}+1}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}}\right)\\[-1.5pt]\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} \\[-1.5pt]&+\,\!\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\[-1.5pt]&+\,\!\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=t_{0}+1}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right).\end{align*}$ View Source

Since

$\begin{align*}&\hspace{-2.8pc}\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=t_{0}+1}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\=&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} - \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {t_{0}\hat {v}_{t_{0},i}}}{1-\beta _{1, t_{0}}}\\\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}},\end{align*}$ View Source

we have

$\begin{align*} (6)\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} + \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d}\frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} \\[-1.5pt]&+\, \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}}\! (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right) \\[-1.5pt]\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} + \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d}\frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} \\[-1.5pt]&+\, \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} \frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \\[-1.5pt]\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t}+ \sqrt {T}}\right),\tag{19}\end{align*}$

View Source

where the second inequality is obtained by omitting the term

$\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} - {x^{*}_{i}})^{2} \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}$

, and the last inequality is by Lemma 4.2 and the assumption that

$\beta _{1,t}\le \beta _{1} (1\le t \le T)$

. Summing up, if

$\beta _{1,t} = \beta _{1}\lambda ^{t-1}$

, then, from (16), (17), and (19), we obtain

$\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1.5pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}} \\[-1.5pt]&+\,\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}$

View Source

If $\beta _{1,t} = \frac {\beta _{1}}{t}$ , then, from (16), (18), and (19), we obtain

$\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1.5pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})} \\[-1.5pt]&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}$ View Source

which ends the proof.

The following corollary shows that, when either $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ or $\beta _{1,t}= 1/t (1\le t \le T)$ , where $0\le \beta _{1} < 1$ and $0 < \lambda < 1$ , the average regret of AMSGrad converges.

Corollary 4.5:

With the same assumption as in Theorem 4.1, AMSGrad achieves the following guarantee:

$\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T} = 0.\end{equation*}$ View Source

Proof:

The result is obtained by using Theorem IV and the following fact:

$\begin{align*} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}=&\sum _{i=1}^{d}\sqrt {g_{1,i}^{2} + g_{2,i}^{2},\ldots + g_{T,i}^{2}}\\[-1.5pt]\le&\sum _{i=1}^{d}\sqrt {TG_{\infty }^{2}}\\[-1.5pt]=&dG_{\infty }\sqrt {T},\end{align*}$ View Source

where the inequality is from the assumption that

$\lVert {g_{t}}\rVert _{\infty } \le G_{\infty }$

for all

$t\in [T]$

.

SECTION V.

New Version of AMSGrad Optimizer: AdamX

Let $f_{1}, f_{2},\ldots , f_{T}: \mathcal F \to \mathbb R$ be an arbitrary sequence of convex cost functions. If the system $\{\beta _{1,t}\}_{1\le t \le T}$ is kept arbitrary, as in the setting of Theorem A, to ensure that the regret $R(T)$ satisfies $R(T)/T\to 0$ , we suggest a new algorithm as follows.

With this Algorithm 2, the regret is bounded as follows.

SECTION Algorithm 2

AdamX: A New Variant of Adam and AMSGrad

Input:

$x_{1}\in \mathbb R^{d}$ , step size $\{\alpha _{t}\}_{t=1}^{T}, \{\beta _{1,t}\}_{t=1}^{T}, \beta _{2}$

Set $m_{0} = 0, v_{0} = 0$ , and $\hat v_{0} = 0$

for $(t=1; t\le T; t\gets t+1)$ do

$g_{t} = \nabla f_{t}(x_{t})$

$m_{t} = \beta _{1,t}\cdot m_{t-1} + (1-\beta _{1,t})\cdot g_{t}$

$v_{t} = \beta _{2}\cdot v_{t-1} + (1-\beta _{2})\cdot g^{2}_{t}$

$\begin{equation*} \hat v_{t} = \begin{cases} v_{t} & \text {if } t = 1\\ \max \left\{{\dfrac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1}, v_{t}}\right\} & \text {if } t\ge 2 \end{cases}\end{equation*}$ View Source

$x_{t+1} = \prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot m_{t}/\sqrt {\hat v_{t}})$ ,

where $\hat V_{t} = \text {diag}(\hat v_{t})$

end for

Output:

$x_{T+1}$

Theorem 5.1:

Let $x_{t}$ and $v_{t}$ be the sequences obtained from Algorithm 2, $\alpha _{t} = \frac {\alpha }{\sqrt {t}}$ , $\beta _{1} = \beta _{1,1}$ , $\beta _{1,t} \le \beta _{1}$ for all $t\in [T]$ and $\frac {\beta _{1}}{\sqrt {\beta _{2}}} \le 1$ . Assume that $\mathcal F$ has bounded diameter $D_{\infty }$ and $\lVert {\nabla f_{t}(x)}\rVert _{\infty } \le G_{\infty }$ for all $t\in [T]$ and $x\in \mathcal F$ . For $x_{t}$ generated using the AdamX (Algorithm 2), we have the following bound on the regret:

$\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\sqrt {T} \\&+\, \frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)} \\&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}$ View Source

To prove Theorem 5.1, we need the following Lemmas 5.2, 5.3, and 5.4.

Lemma 5.2:

For all $t\ge 1$ , we have

$\begin{equation*} \hat {v}_{t} = \max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le t-1},\tag{20}\end{equation*}$ View Source

where $\hat {v}_{t}$ is in Algorithm 2.

Proof:

We will prove (20) by induction on $t$ . Recall that by the update rule on $\hat {v}_{t}$ , we have $\hat v_{1} \mathop{=}\limits^{\Delta } v_{1}$ and $\hat v_{t} \mathop{=}\limits^{\Delta } \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1}, v_{t}}\right\}$ if $t\ge 2$ . Therefore,

$\begin{align*} \hat v_{2}\overset{\Delta }{=}&\max \left \{{\frac {(1-\beta _{1,2})^{2}}{(1-\beta _{1,1})^{2}}\hat v_{1}, v_{2}}\right \}\\=&\max \left \{{\frac {(1-\beta _{1,2})^{2}}{(1-\beta _{1,1})^{2}} v_{1}, v_{2}}\right \}\\=&\max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le 2}.\end{align*}$ View Source

Assume that

$\begin{equation*}\hat {v}_{t-1} = \max \left \{{\frac {(1-\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le t-1}\end{equation*}$

View Source

and the (20) holds for all

$1\le j\le t-1$

. Since

$\begin{equation*}\hat {v}_{t}\overset {\Delta }{=} \max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat {v}_{t-1}, v_{t}}\right \},\end{equation*}$

View Source

we have

$\begin{align*}&\hspace{-1.2pc}\hat {v}_{t} \\=&\max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1\!-\!\beta _{1,t-1})^{2}}\left ({\max \left\{{\frac {(1\!-\!\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}\hat {v}_{s}}\right\}_{1\le s\le t-1}}\right),\! v_{t}}\right \} \\=&\max \Bigg \{ \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\frac {(1-\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}{v}_{s}}\right\}_{1\le s\le t-1}, \\&\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}v_{t}\Bigg \} \\=&\max \left\{{ \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}} {v}_{s}}\right\}_{1\le s\le t-1}, \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}v_{t}}\right\} \\=&\max \left\{{ \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}} {v}_{s}}\right\}_{1\le s\le t-1},\end{align*}$

View Source

which ends the proof.

Lemma 5.3:

For all $t\ge 1$ , we have $\sqrt {\hat {v}_{t}} \le \frac {G_{\infty }}{1-\beta _{1}}$ , where $\hat {v}_{t}$ is in Algorithm 2.

Proof:

By Lemma 5.2,

$\begin{equation*}\hat {v}_{t} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right\}_{1\le s\le t}.\end{equation*}$ View Source

Therefore, there is some

$s$

such that

$1\le s\le t$

and

$\hat {v}_{t} = \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}$

. Hence,

$\begin{align*} \sqrt {\hat {v}_{t}}=&\sqrt {\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}} \\=&\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}g^{2}_{k}}\\\le&\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k} (\max _{1\le k\le s}{|g_{k}|})^{2}}\\=&G_{\infty }\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}}\\\le&G_{\infty }\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\frac {1}{\sqrt {1-\beta _{2}}}\\=&\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)G_{\infty }\\\le&\frac {G_{\infty }}{1-\beta _{1}},\end{align*}$

View Source

which ends the proof.

Lemma 5.4:

For the parameter settings and conditions assumed in Theorem 5.1, we have

$\begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T +1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}$ View Source

Proof:

Since for all $t\ge 1$

$\begin{equation*}\hat {v}_{t,i} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right\}_{1\le s\le t},\end{equation*}$ View Source

by Lemma 5.2, we have

$\hat {v}_{t,i} \ge v_{t,i}$

, and hence the proof is the same as that of Lemma 4.4.

Proof ofTheorem 5.1:

Similarly to the proof of Theorem 4.1, we need to bound (6), (7), and (8). By using Lemma 5.4, we obtain the same bound for (7) as in the proof of Theorem 4.1, that is,

$\begin{align*} (7)=&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\=&\frac {\alpha }{1-\beta _{1}}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac { m_{t,i}^{2}}{\sqrt {t\hat v_{t,i}}}\\\le&\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}$ View Source

where the last inequality is by Lemma 5.4. Now we bound (8). By the assumption that

$\lVert {x_{m}-x_{n}}\rVert _{\infty } \le D_{\infty }$

for any

$m,n\in \{1,\ldots ,T\}$

,

$\alpha _{t} = \alpha /\sqrt {t}$

, and

$\beta _{1,t} = \beta _{1}\lambda ^{t-1} \le \beta _{1} \le 1$

, we obtain

$\begin{align*} (8)=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t})} (x_{t,i} - {x^{*}_{i}})^{2}\\\le&\frac {D_{\infty }^{2}}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)\hat {v}_{t-1,i}}.\end{align*}$

View Source

Therefore, from Lemma 5.3, we obtain

$\begin{equation*} (8) \le \frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})^{2}} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)}.\end{equation*}$

View Source

Finally, we will bound (6). By (9) and replacing

$\alpha _{t} = \frac {\alpha }{\sqrt {t}} (1\le t\le T)$

, we obtain

$\begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha (1-\beta _{1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\! \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} - {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\end{align*}$

View Source

Moreover, by the update rule of Algorithm 2, we have

$\begin{equation*}\hat v_{t,i} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}, v_{t,i}}\right\}.\end{equation*}$

View Source

Therefore,

$\hat v_{t,i} \ge \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}$

, and hence

$\begin{align*}&\hspace{-2.8pc}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} \\\ge&\frac {\sqrt {t\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}\\=&\frac {\sqrt {t \hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} > 0.\end{align*}$

View Source

Now by the positivity of the essential formula

$\begin{equation*}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}},\end{equation*}$

View Source

we obtain

$\begin{align*} (6)\le&\frac {D_{\infty }^{2} }{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1}} \\+&\frac {D_{\infty }^{2} }{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\=&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac {\sqrt {T\hat {v}_{T,i}} }{1-\beta _{1,T}}\\\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})^{2}}\sqrt {T}~,\end{align*}$

View Source

where the last inequality is by Lemma 5.3. Hence we obtain the desired upper bound for

$R(T)$

.

Corollary 5.5:

With the same assumption as in Theorem 5.1, and for all $0 \le \beta _{1,t} < 1$ satisfying

$\begin{equation*}\lim _{T\to \infty }\frac {\sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}}{T} = 0,\end{equation*}$ View Source

AdamX achieves the following guarantee:

$\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T}=0.\end{equation*}$

View Source

Proof:

By Theorem 5.1, it is sufficient to consider the term

$\begin{equation*}\frac {dD_{\infty }^{2}~G_{\infty }}{2\alpha (1-\beta _{1})^{2}} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}\end{equation*}$ View Source

on the right hand side of the upper bound for

$R(T)$

in Theorem 5.1. Because

$\frac {dD_{\infty }^{2}\,\,G_{\infty }}{2\alpha (1-\beta _{1})^{2}}$

is bounded and does not depend on

$T$

, the statement follows.

When either $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ for some $\lambda \in (0,1)$ , or $\beta _{1,t} = \frac {1}{t}$ in Theorem 5.1, we obtain the following guarantee that the average regret of AdamX converges.

Corollary 5.6:

With the same assumption as in Theorem 5.1, and either $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ for some $\lambda \in (0,1)$ , or $\beta _{1,t} = \frac {1}{t}$ , AdamX achieves the following guarantee:

$\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T} = 0.\end{equation*}$ View Source

Proof:

By Corollary 5.5, it is sufficient to consider the term

$\begin{equation*}\sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}.\end{equation*}$ View Source

When

$\beta _{1,t} = \beta _{1}\lambda ^{t-1}$

for some

$\lambda \in (0,1)$

, we have

$\begin{align*} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}=&\sum _{t=2}^{T}\beta _{1}\lambda ^{t-1}\sqrt {t-1} \\\le&\sum _{t=2}^{T}\sqrt {(t-1)} \lambda ^{t-1} \\\le&\sum _{t=2}^{T}t \lambda ^{t-1} \\\le&\frac {1}{(1-\lambda)^{2}}\tag{21}\end{align*}$

View Source

where the first inequality is from the property that

$\beta _{1} \le 1$

, and the last inequality is from Lemma 2.4. When

$\beta _{1,t} = \frac {1}{t}$

, we obtain

$\begin{align*} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}=&\sum _{t=2}^{T}\frac {\sqrt {t-1}}{t} \\\le&\sum _{t=2}^{T}\frac {1}{\sqrt {t}} \\\le&2\sqrt {T},\tag{22}\end{align*}$

View Source

where the last inequality is from Lemma 2.6. Now, by combining (21) and (22) with Corollary 5.5, we obtain the desired result.

SECTION VI.

Experiments

While we consider our main contributions as the theoretical analyses on AMSGrad and AdamX in the previous sections, we provide experimental results in this section for AMSGrad and AdamX. Concretely, we use the PyTorch code for AMSGrad² via setting the Boolean flag amsgrad = True. The code for AdamX is based on that of AMSGrad, with corresponding modifications as in Algorithm 2. The parameters for AMSGrad and AdamX are identical in our experiments, namely $(\beta _{1}, \beta _{2})=(0.9, 0.999)$ , the term added to the denominator to improve numerical stability is $\epsilon = 10^{-8}$ , and additionally we set $\beta _{1,t} = \beta _{1}\lambda ^{t-1}$ with $\lambda =0.001$ to make use of Corollary 5.6 on the convergence of AdamX.

The learning rate is scheduled for both optimizers AMSGrad and AdamX as follows: 10⁻³, 10⁻⁴, 10⁻⁵, 10⁻⁶, $10^{-6}/2$ if the epoch is correspondingly in the ranges [0, 80], [0, 80], [81, 120], [121, 160], [161, 180], [181, 200]. We use CIFAR³-10 (containing 50000 training images and 10000 test images of size $32\times 32$ ) as the dataset and the residual networks ResNet18 [8] and PreActResNet18 [9] for training with batch size is 128. The testing result is given in Figure 1 where one can see that AMSGrad and AdamX behaves similarly, which supports our theoretical results on the convergence of both AMSGrad (Section IV) and AdamX (Section V).

FIGURE 1.

Testing accuracies over CIFAR-10 using AMSGrad and AdamX, with different neural network models.

Show All

SECTION VII.

Conclusion

We have shown that the convergence proof of AMSGrad [3] is problematic, and presented various fixes for it, which include a new and slightly modified version called AdamX. Our work helps ensure the theoretical foundation of those optimizers.

On the Convergence Proof of AMSGrad and a New Version

Alerts

Abstract:

Metadata

Abstract:

Introduction and Our Contributions

AMSGrad (Reddi et al. [3])

Theorem A [Theorem 4 in[3], problematic]:

Preliminaries

Definition 2.1:

Lemma 2.2:

Lemma 2.3 (Cauchy–Schwarz inequality):

Lemma 2.4 (Taylor series):

Lemma 2.5 (Upper bound for the harmonic series):

Lemma 2.6:

Lemma 2.7:

Lemma 2.8:

Issue in the Convergence Proof of AMSGrad

Lemma 3.1:

Proof:

New Convergence Theorem for AMSGrad

Theorem 4.1 (Fixes for Theorem A):

Lemma 4.2:

Proof:

Lemma 4.3:

Proof:

Lemma 4.4:

Proof:

Proof ofTheorem 4.1:

Corollary 4.5:

Proof:

New Version of AMSGrad Optimizer: AdamX

AdamX: A New Variant of Adam and AMSGrad

Theorem 5.1:

Lemma 5.2:

Proof:

Lemma 5.3:

Proof:

Lemma 5.4:

Proof:

Proof ofTheorem 5.1:

Corollary 5.5:

Proof:

Corollary 5.6:

Proof:

Experiments

Conclusion

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?