Loading [MathJax]/extensions/MathZoom.js
On the Convergence Proof of AMSGrad and a New Version | IEEE Journals & Magazine | IEEE Xplore

On the Convergence Proof of AMSGrad and a New Version


Reddi et al. (ICLR 2018) have recently shown that the Adam optimizer (ICLR 2015) is problematic and they have also proposed a variant of Adam called AMSGrad as a fix. We ...

Abstract:

The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown th...Show More

Abstract:

The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic, and they have also proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our experiments on the benchmark dataset also support our theoretical results.
Reddi et al. (ICLR 2018) have recently shown that the Adam optimizer (ICLR 2015) is problematic and they have also proposed a variant of Adam called AMSGrad as a fix. We ...
Published in: IEEE Access ( Volume: 7)
Page(s): 61706 - 61716
Date of Publication: 13 May 2019
Electronic ISSN: 2169-3536

SECTION I.

Introduction and Our Contributions

One of the most popular algorithms for training deep neural networks is stochastic gradient descent (SGD) [1] and its variants. Among the various variants of SGD, the algorithm with the adaptive moment estimation Adam [2] is widely used in practice. However, Reddi et al. [3] have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad to solve this issue.

Our contribution. In this paper, we point out a flaw in the convergence proof of AMSGrad, recalled as Theorem A below. We then fix this flaw by providing a new convergence proof for AMSGrad in the case of special parameters. In addition, in the case of general parameters, we propose a new and slightly modified version of AMSGrad.

To provide more details, let us recall AMSGrad in Algorithm 1, in which the mathematical notation can be fully found in Section II.

SECTION Algorithm 1

AMSGrad (Reddi et al. [3])

Input:

x_{1}\in \mathcal F , step size \{\alpha _{t}\}_{t=1}^{T}, \{\beta _{1,t}\}_{t=1}^{T}, \beta _{2}

Set m_{0} = 0, v_{0} = 0 , and \hat v_{0} = 0

for (t=1; t\le T; t\gets t+1) do

g_{t} = \nabla f_{t}(x_{t})

m_{t} = \beta _{1,t}\cdot m_{t-1} + (1-\beta _{1,t})\cdot g_{t}

v_{t} = \beta _{2}\cdot v_{t-1} + (1-\beta _{2})\cdot g^{2}_{t}

\hat v_{t} = \max (\hat v_{t-1}, v_{t}) and \hat V_{t} = \text {diag}(\hat v_{t})

x_{t+1} = \prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot m_{t}/\sqrt {\hat v_{t}})

end for

The main theorem for the convergence of AMSGrad in [3] is as follows. To simplify the notation, we define g_{t} \mathop{=}\limits^{\Delta } \nabla f_{t}(x_{t}) , g_{t,i} as the i^{\text {th}} element of g_{t} and g_{1:t,i} \in \mathbb R^{t} as a vector that contains the i^{\text {th}} dimension of the gradients over all iterations up to t , namely, g_{1:t,i} = [g_{1,i}, g_{2,i},\ldots ,g_{t,i}] .

Theorem A [Theorem 4 in[3], problematic]:

Let x_{t} and v_{t} be the sequences obtained from Algorithm 1, \alpha _{t} = \frac {\alpha }{\sqrt {t}} , \beta _{1} = \beta _{1,1} , \beta _{1,t} \le \beta _{1} for all t\in [T] and \frac {\beta _{1}}{\sqrt {\beta _{2}}} \le 1 . Assume that \mathcal F has bounded diameter D_{\infty } and \lVert {\nabla f_{t}(x)}\rVert _{\infty } \le G_{\infty } for all t\in [T] and x\in \mathcal F . For x_{t} generated using AMSGrad (Algorithm 1), we have the following bound on the regret:\begin{align*} R(T)\le&\frac {D_{\infty }^{2}\sqrt {T}}{\alpha (1-\beta _{1})}\sum \limits _{i=1}^{d} \sqrt {\hat {v}_{T,i}} \\[-2pt]&+\,\frac {D_{\infty }^{2}}{2(1-\beta _{1})} \sum \limits _{i=1}^{d} \sum _{t=1}^{T}\frac {\beta _{1,t}\sqrt {\hat v_{t,i}}}{\alpha _{t}} \\[-2pt]&+\,\frac {\alpha \sqrt { 1+\ln T}}{(1-\beta _{1})^{2}(1-\gamma)\sqrt {1-\beta _{2}}} \sum \limits _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}

View SourceRight-click on figure for MathML and additional features.

In their proof for Theorem A, Reddi et al. resolved an issue on the so-called telescopic sum in the convergence proof of Adam ([2, Theorem 10.5]). Specifically, Reddi et al. adjusted \hat v_{t} such that \begin{equation*} {\frac {\sqrt {\hat v_{t+1,i}}}{\alpha _{t+1}} \ge \frac {\sqrt {\hat v_{t,i}}}{\alpha _{t}} }\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. for all i \in [d] . However, there is another issue (showed in Section III) in the convergence proof of Adam that AMSGrad unfortunately neglects. The issue affects both the correctness of Reddi et al.’s proof and the upper bound for the regret in Theorem A. To deal with the issue in a general way, we propose to modify Algorithm 1 such that \begin{equation*} {\frac {\sqrt {\hat {v}_{t+1,i}}}{\alpha _{t+1}(1-\boxed {\beta _{1, t+1}})} \ge \frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}(1-\boxed {\beta _{1, t}})}}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
for all i \in [d] . The differences with (1) are highlighted in the boxes for clarity.

Paper roadmap. We begin with preliminaries in Section II. We show where the proof of Theorem A becomes invalid in Section III. After that, we suggest two ways to resolve the issue in Sections IV and V.

SECTION II.

Preliminaries

Notation. Given a sequence of vectors \{x_{t}\}_{1\le t\le T} (1\le T\in \mathbb N) in \mathbb R^{d} , we denote its i^{\text {th}} coordinate by x_{t,i} and use x_{t}^{k} to denote the elementwise power of k and \lVert {x_{t}}\lVert _{2} , resp. \lVert {x_{t}}\lVert _{\infty } , to denote its \ell _{2} -norm, resp. \ell _{\infty } -norm. Let \mathcal F \subseteq \mathbb R^{d} be a feasible set of points such that \mathcal F has bounded diameter D_{\infty } , that is, \lVert {x-y}\rVert _{\infty } \le D_{\infty } for all x,y\in \mathcal F , and \mathcal S^{d}_{+} denote the set of all positive definite d\times d matrices. For a matrix A\in \mathcal S^{d}_{+} , we denote A^{1/2} for the square root of A . The projection operation \prod _{\mathcal F, A} (y) for A \in \mathcal S^{d}_{+} is defined as \mathrm {argmin}_{ x\in \mathcal F}\lVert {A^{1/2} (x-y)}\lVert _{2} for all y \in \mathbb R^{d} . When d=1 and \mathcal F \subset \mathbb R , the positive definite matrix A is a positive number, so that the projection \prod _{\mathcal F, A} (y) becomes \mathrm {argmin}_{x\in \mathcal F}|x-y| . We use \langle x, y \rangle to denote the inner product between x and y\in \mathbb R^{d} . The gradient of a function f evaluated at x\in \mathbb R^{d} is denoted by \nabla f(x) . For vectors x, y\in \mathbb R^{d} , we use \sqrt {x} or x^{1/2} for element-wise square root, x^{2} for element-wise square, x/y to denote element-wise division. For an integer n\in \mathbb N , we denote by [n] the set of integers \{1,2,\ldots ,n\} .

Optimization setup. Let f_{1}, f_{2},\ldots , f_{T}: \mathbb R^{d} \to \mathbb R be an arbitrary sequence of convex cost functions and x_{1}\in \mathbb R^{d} . At each time t\ge 1 , the goal is to predict the parameter x_{t} and evaluate it on a previously unknown cost function f_{t} . Since the nature of the sequence is unknown in advance, the algorithm is evaluated by using the regret, that is, the sum of all the previous differences between the online prediction f_{t}(x_{t}) and the best fixed-point parameter f_{t}(x^{*}) from a feasible set \mathcal F for all the previous steps. Concretely, the regret is defined as \begin{equation*} R(T) = \sum _{t=1}^{T}[f_{t}(x_{t}) -f_{t}(x^{*})],\end{equation*}

View SourceRight-click on figure for MathML and additional features. where x^{*} = \mathrm {argmin}_{x\in \mathcal F}\sum _{t=1}^{T}f_{t}(x) .

Definition 2.1:

A function f: \mathbb R^{d} \rightarrow \mathbb R is convex if for all x, y\in \mathbb R^{d} , and all \lambda \in [{0,1}] , \begin{equation*} \lambda f(x) + (1-\lambda)f(y) \ge f(\lambda x + (1-\lambda)y).\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Lemma 2.2:

If a function f: \mathbb R^{d} \rightarrow \mathbb R is convex, then for all x, y\in \mathbb R^{d} , \begin{equation*} f(y) \ge f(x) + \nabla f(x)^{\sf T}(y-x),\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \nabla f(x)^{\sf T} denotes the transpose of \nabla f(x) .

Lemma 2.3 (Cauchy–Schwarz inequality):

For all n\ge 1 , u_{i}, v_{i}\in \mathbb R (1\le i \le n) , \begin{equation*} \left ({\sum _{i=1}^{n} u_{i}v_{i}}\right)^{2} \le \left ({\sum _{i=1}^{n}u_{i}^{2}}\right) \left ({\sum _{i=1}^{n}v_{i}^{2}}\right).\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Lemma 2.4 (Taylor series):

For \alpha \in \mathbb R and 0 < \alpha < 1 , \begin{equation*} \sum _{t\ge 1}{\alpha ^{t}} = \frac {1}{1-\alpha }\end{equation*}

View SourceRight-click on figure for MathML and additional features.and \begin{equation*} \sum _{t\ge 1}{t\alpha ^{t-1}} = \frac {1}{(1-\alpha)^{2}}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.

Lemma 2.5 (Upper bound for the harmonic series):

For N\in \mathbb N , \begin{equation*} \sum _{n=1}^{N} \frac {1}{n}\le \ln N +1.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Lemma 2.6:

For N\in \mathbb N , \begin{equation*} \sum _{n=1}^{N} \frac {1}{\sqrt {n}}\le 2\sqrt {N}.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Lemma 2.7:

For all n\in \mathbb N and a_{i}, b_{i} \in \mathbb R such that a_{i}\ge 0 and b_{i}>0 for all i\in [n] , \begin{equation*} \frac {\sum _{i=1}^{n}a_{i}}{\sum _{j=1}^{n} b_{j}} \le \sum _{i=1}^{n}\frac {a_{i}}{b_{i}}.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Lemma 2.8:

[4, Lemma 3 in arXiv version] For any Q \in \mathcal S^{d}_{+} and convex feasible set \mathcal F\subseteq \mathbb R^{d} , suppose u_{1} = \mathrm {argmin}_{x\in \mathcal F}\lVert Q ^{1/2}(x-z_{1})\rVert and u_{2} = \mathrm {argmin}_{x\in \mathcal F}\lVert Q ^{1/2}(x-z_{2})\rVert . Then, we have \begin{equation*} \lVert Q ^{1/2}(u_{1}-u_{2})\rVert \le \lVert Q ^{1/2}(z_{1}-z_{2})\rVert.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

SECTION III.

Issue in the Convergence Proof of AMSGrad

Before showing the issue in the convergence proof of AMSGrad, let us recall and prove the following inequality, which also appears in [3].

Lemma 3.1:

Algorithm 1 achieves the following guarantee, for all T\ge 1 :\begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}((x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2})}{ 2\alpha _{t}(1-\beta _{1,t})} \\&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2}.\end{align*}

View SourceRight-click on figure for MathML and additional features.

Proof:

We note that \begin{align*} x_{t+1}=&\prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot {\hat V^{-1/2}_{t}}m_{t}) \\[-2.5pt]=&\mathrm {argmin}_{x\in \mathcal F}\lVert \hat V^{1/4}(x-(x_{t} - \alpha _{t} \hat V^{-1/2}m_{t}))\rVert\end{align*}

View SourceRight-click on figure for MathML and additional features. and \prod _{\mathcal F, \sqrt {\hat V_{t}}}(x^{*}) = x^{*} for all x^{*} \in \mathcal F . For all 1\le t \le T , put g_{t} = \nabla _{x} f_{t}(x_{t}) . Using Lemma 2.8 with u_{1} = x_{t+1} and u_{2} = x^{*} , we have \begin{align*}&\hspace{-1.8pc}\rlap{\text{$\displaystyle \lVert \hat V^{1/4}(x_{t+1} - x^{*}) \rVert ^{2} $}}\qquad \\[-2.5pt]\le&\lVert \hat V^{1/4}(x_{t} - \alpha _{t} \hat V^{-1/2}m_{t} - x^{*}) \rVert ^{2} \\[-2.5pt]=&\lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} \! +\! \alpha ^{2}_{t} \lVert \hat V^{-1/4}m_{t}\rVert ^{2} \!-\! 2\alpha _{t}\langle m_{t}, x_{t}- x^{*}\rangle \\[-2.5pt]=&\lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} + \alpha ^{2}_{t} \lVert \hat V^{-1/4}m_{t}\rVert ^{2} \\[-2.5pt]&-\,2\alpha _{t}\langle \beta _{1,t}m_{t-1} + (1-\beta _{1,t})g_{t}, x_{t}- x^{*}\rangle.\end{align*}
View SourceRight-click on figure for MathML and additional features.
This yields \begin{align*} \langle g_{t}, x_{t}- x^{*}\rangle\le&\frac {1}{2\alpha _{t}(1-\beta _{1,t})}\Big [ \lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} \\[-2.5pt]&\qquad -\lVert \hat V^{1/4}(x_{t+1} - x^{*}) \rVert ^{2} \Big] \\[-2.5pt]&+\,\frac {\alpha _{t}}{2(1-\beta _{1,t})}\lVert \hat V^{-1/4}m_{t}\rVert ^{2} \\[-2.5pt]&-\,\frac {\beta _{1,t}}{1-\beta _{1,t}}\langle m_{t-1}, x_{t}- x^{*}\rangle.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Therefore, we obtain \begin{align*}&\hspace{-1.2pc}\rlap{\text{$\displaystyle \sum _{i=1}^{d} g_{t,i} (x_{t,i} - {x^{*}_{i}}) $}}\qquad \\[-2.5pt]\le&\sum _{i=1}^{d} \frac {\sqrt {\hat v_{t,i}}}{2\alpha _{t}(1-\beta _{1,t}) } \Big ((x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2}\Big) \\[-2.5pt]&+\,\sum _{i=1}^{d} \frac {\alpha _{t}}{2(1-\beta _{1,t}) } \frac { m^{2}_{t,i}}{\sqrt {\hat v_{t,i}}} - \sum _{i=1}^{d}\frac {\beta _{1,t}}{1 - \beta _{1,t}}m_{t-1,i}(x_{t,i} - {x^{*}_{i}}). \\[-2.5pt]\tag{2}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Moreover, by Lemma 2.2, we have f_{t}(x^{*}) -f_{t}(x_{t}) \ge g_{t}^{\text T} (x^{*}-x_{t}) , where g_{t}^{\sf T} denotes the transpose of vector g_{t} . This means that \begin{equation*} f_{t}(x_{t}) - f_{t}(x^{*}) \le g_{t}^{\sf T}(x_{t}-x^{*}) = \sum _{i=1}^{d} g_{t,i}(x_{t,i}-{x^{*}_{i}}).\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Hence, \begin{align*} R(T)=&\sum _{t=1}^{T} [f_{t}(x_{t}) - f_{t}(x^{*})] \\[-2.5pt]\le&\sum _{t=1}^{T}g_{t}^{\sf T} (x_{t} - x^{*}) = \sum _{t=1}^{T}\sum _{i=1}^{d}g_{t,i} (x_{t,i} - {x^{*}_{i}}).\tag{3}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Combining (2) with (3), we obtain \begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{2\alpha _{t}(1-\beta _{1,t}) } ((x_{t,i}\! -\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2}) \\[-2.5pt]&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t}) } \frac { m^{2}_{t,i}}{\sqrt {\hat v_{t,i}}} \\[-2.5pt]&+\,\sum _{i=1}^{d} \sum _{t=2}^{T} \frac {\beta _{1,t}}{1-\beta _{1,t}}m_{t-1,i}({x^{*}_{i}} - x_{t,i}).\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the last term is from the setting that m_{0} = 0 . On the other hand, for all t\ge 2 , we have \begin{align*} m_{t-1,t}({x^{*}_{i}} \!-\! x_{t,i})=&\frac {(\hat {v}_{t-1,i})^{1/4}}{\sqrt {\alpha _{t-1}}} ({x^{*}_{i}} \!-\! x_{t,i}) \sqrt {\alpha _{t-1}} \frac {m_{t-1,i}}{(\hat {v}_{t-1,i})^{1/4}} \\\le&\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}} (x_{t,i} \!-\! {x^{*}_{i}})^{2} + {\alpha _{t-1}} \frac {m^{2}_{t-1,i}}{2\sqrt {\hat {v}_{t-1,i}}},\end{align*}
View SourceRight-click on figure for MathML and additional features.
Therefore, we obtain \begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1,t}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right) \\&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1-\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t})} (x_{t,i} - {x^{*}_{i}})^{2}.\tag{4}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Moreover, we have \begin{align*} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1\!-\!\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}}=&\sum _{i=1}^{d} \sum _{t=1}^{T-1}\frac {\beta _{1,t+1}\alpha _{t}}{2(1\!-\!\beta _{1,t+1})} \frac {m^{2}_{t,i}}{\sqrt {\hat {v}_{t,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1\!-\!\beta _{1,t+1})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}},\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the last inequality is from the assumption that \beta _{1,t} \le \beta _{1} < 1 (1\le t\le T) . Therefore, \begin{align*}&\hspace{-1.8pc}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} + \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1-\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}.\tag{5}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Hence, from (4) and (5) we have \begin{align*} R(T) \le & \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1,t}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right)\\ \quad &+ \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}\\ \quad & + \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2},\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the last term is from the property that \beta _{1,t} \le \beta _{1} (1\le t \le T) .

Issue in the convergence proof of AMSGrad. We denote the terms on the right hand-side of the upper bound for R(T) in Lemma 3.1 as \begin{align*} &\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1\!-\!\beta _{1,t}) }\left ({(x_{t,i} \!- \!{x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right),\tag{6}\\&\qquad \qquad\quad\quad \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}},\tag{7}\end{align*}

View SourceRight-click on figure for MathML and additional features. and \begin{equation*} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2}.\tag{8}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

The issue in the proof of the convergence theorem of AMSGrad [3, Theorem 4] becomes on examining the term (6). Indeed, in [3, page 18], Reddi et al. used1 the property that \beta _{1,t} \le \beta _{1} , and hence \begin{equation*} \frac {1}{1-\beta _{1,t}} \leq \frac {1}{1-\beta _{1}},\end{equation*}

View SourceRight-click on figure for MathML and additional features. to replace all \beta _{1,t} by \beta _{1} as \begin{align*} (6)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right) \\\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\frac {1}{2(1\!-\!\beta _{1})}\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}} \!-\! \frac {\sqrt {\hat {v}_{t-1,i}}}{\alpha _{t-1}} }\right).\end{align*}
View SourceRight-click on figure for MathML and additional features.
However, the first inequality (in red) is not guaranteed because the quantity \begin{equation*}(x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
in (6) may be both negative and positive as shown in Counter-example 3.2. This is also a neglected issue in the convergence proofs in Kingma and Ba [2, Theorem 10.5], Luo et al. [5, Theorem 4], Bock et al. [6, Theorem 4.4], and Chen and Gu [7, Theorem 4.2].

Counter-example III.2 (For AMSGrad Convergence Proof): We use the function in the Synthetic Experiment of Reddi et al. [3, Page 6] \begin{equation*} f_{t}(x)=\begin{cases}{1010 x,} & t~{{~\text {mod }} 101=1} \\ {-10 x,} &~{{~\text {otherwise }}}\end{cases}\end{equation*}

View SourceRight-click on figure for MathML and additional features. with the constraint set \mathcal F = [-1,1] . The optimal solution is x^{*}=-1 . By the proof of [3, Theorem 1], the initial point x_{1} = 1 . By Algorithm 1, m_{0} = 0 , v_{0} = 0 , and \hat v_{0} = 0 . We choose \beta _{1} = 0.9 , \beta _{1,t} = \beta _{1}\lambda ^{t-1} , where \lambda = 0.001 , \beta _{2} = 0.999 , and \alpha _{t} = \alpha /\sqrt {t} , where \alpha = 0.001 . Under this setting, we have f_{1}(x_{1}) = 1010x_{1} , f_{2}(x_{2}) = -10x_{2} , f_{3}(x_{3}) = -10x_{3} and hence \begin{align*} g_{1}=&\nabla f_{1}(x_{1}) = 1010,\\ m_{1}=&\beta _{1,1}m_{0} + (1-\beta _{1,1})g_{1} = (1-0.9)1010 = 101,\\ v_{1}=&\beta _{2}v_{0} + (1-\beta _{2})g_{1}^{2} = (1-0.999)1010^{2} = 1020.1,\\ \hat v_{1}=&\max (\hat v_{0}, v_{1}) = v_{1}.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Therefore, \begin{align*} x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}=&1-(0.001)101/\sqrt {1020.1}\\=&0.9968377223398316.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Since x_{1} - \alpha _{1}\,\,m_{1}/\sqrt {\hat v_{1}}>0 , we have \begin{align*} x_{2}=&\prod _{\mathcal F}(x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}) \\=&\min (1, x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}) \\=&0.9968377223398316.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Hence, \begin{equation*} (x_{1} - x^{*})^{2} - (x_{2} - x^{*})^{2} = 0.001264811064067839 >0.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
At t=2 , we have \begin{align*} g_{2}=&-10,\\ m_{2}=&\beta _{1,2}m_{1} + (1-\beta _{1,2})g_{2} \\=&(0.9)(0.001)(101) + [1-(0.9)(0.001)](-10) \\=&-9.9001,\\ v_{2}=&\beta _{2}v_{1} + (1-\beta _{2})g_{2}^{2}\\=&(0.999)(1020.1) + (1-0.999)(-10)^{2} \\=&1019.1799000000001,\\ \hat v_{2}=&\max (\hat v_{1}, v_{2}) = v_{1}\\=&1020.1.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Therefore, \begin{align*} x_{2} \!-\! \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}=&0.9968377223398316 \!-\! \frac {0.001}{\sqrt {2}} \frac {(-9.9001)}{\sqrt {1020.1}} \\=&0.9970569034941291.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Since x_{2} - \alpha _{2}\,\,m_{2}/\sqrt {\hat v_{2}}>0 , we obtain \begin{align*} x_{3}=&\prod _{\mathcal F}(x_{2} - \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}) \\=&\min (1, x_{2} - \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}) \\=&0.9970569034941291.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Hence, \begin{equation*} (x_{2} - x^{*})^{2} - (x_{3} - x^{*})^{2} = -0.0008753864342319062 < 0.\end{equation*}
View SourceRight-click on figure for MathML and additional features.

Outline of our solution. Let us rewrite (6) as \begin{align*} (6)=&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac { \sqrt {\hat {v}_{t,i}}}{2\alpha _{t}(1-\beta _{1,t})} (x_{t, i} - {x^{*}_{i}})^{2} \\&- \, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t-1})} (x_{t, i} - {x^{*}_{i}})^{2} \\&- \, \sum _{i=1}^{d}\frac {\sqrt {\hat {v}_{T,i}}}{2\alpha _{T}(1-\beta _{1,T})} (x_{T+1, i} - {x^{*}_{i}})^{2}\\\le&{\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2}} \\[-1pt]&+ \, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac { \sqrt {\hat {v}_{t,i}}}{2\alpha _{t}(1-\beta _{1,t})} (x_{t, i} - {x^{*}_{i}})^{2} \\[-1pt]&-\, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t-1})} (x_{t, i} - {x^{*}_{i}})^{2}.\end{align*}

View SourceRight-click on figure for MathML and additional features.

Therefore, \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\[-1pt]&+ \,\frac {1}{2}\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \Biggl (\frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}(1-\boxed {\beta _{1,t}})} \\[-1pt]&\qquad -\, \frac {\sqrt {\hat {v}_{t-1,i}}}{\alpha _{t-1}(1\!-\!\boxed {\beta _{1,t-1}})} \Biggr), \tag{9}\end{align*}

View SourceRight-click on figure for MathML and additional features. in which the differences with Reddi et al. [3] are highlighted in the boxes, namely, \boxed {\beta _{1,t}} and \boxed {\beta _{1,t-1}} instead of \beta _{1} .

We suggest two ways to overcome these differences depending on the setting of \beta _{1,t} (1\le t \le T) :

  • In Section IV: If either \beta _{1,t} \mathop{=}\limits^{\Delta } \beta _{1}\lambda ^{t-1} or \beta _{1,t} \mathop{=}\limits^{\Delta }1/t , (1\le t \le T) , where 0\le \beta _{1} < 1 and 0 < \lambda < 1 , then we give a new convergence theorem for AMSGrad in Section IV.

  • In Section V: If the setting for \beta _{1,t} (1\le t \le T) is general, as in the statement of Theorem A, then we suggest a new (slightly modified) version for AMSGrad in Section V.

SECTION IV.

New Convergence Theorem for AMSGrad

When either \beta _{1,t} \mathop{=}\limits^{\Delta } \beta _{1}\lambda ^{t-1} or \beta _{1,t} \mathop{=}\limits^{\Delta }1/t (1\le t \le T) , where 0\le \beta _{1} < 1 and 0 < \lambda < 1 , Theorem A can be fixed as follows, in which the upper bounds of the regret R(T) are changed.

Theorem 4.1 (Fixes for Theorem A):

Let x_{t} and v_{t} be the sequences obtained from Algorithm 1, \alpha _{t} = \frac {\alpha }{\sqrt {t}} , either \beta _{1,t} = \beta _{1}\lambda ^{t-1} , where \lambda \in (0,1) , or \beta _{1,t} = \frac {\beta _{1}}{t} for all t\in [T] and \gamma = \frac {\beta _{1}}{\sqrt {\beta _{\vphantom {H_{j}}2}}} \le 1 . Assume that \mathcal F has bounded diameter D_{\infty } and \lVert {\nabla f_{t}(x)}\rVert _{\infty } \le G_{\infty } for all t\in [T] and x\in \mathcal F . For x_{t} generated using AMSGrad (Algorithm 1), we have the following bound on the regret. Then, there is some 1\le t_{0} \le T such that AMSGrad achieves the following guarantee for all T\ge 1 :\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}} \\[-1pt]&+ \,\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}

View SourceRight-click on figure for MathML and additional features. provided \beta _{1,t} = \beta _{1}\lambda ^{t-1} , and \begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})} \\[-1pt]&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
View SourceRight-click on figure for MathML and additional features.
provided \beta _{1,t} = \frac {\beta _{1}}{t} .

To prove Theorem 4.1, we need the following Lemmas 4.2, 4.3, and 4.4.

Lemma 4.2:

\sqrt {\hat {v}_{t}} \le G_{\infty } .

Proof:

From the definition of \hat {v}_{t} in AMSGrad’s algorithm, it is implied that \hat {v}_{t} = \max \{v_{1},\ldots ,v_{t}\} . Therefore, there is some 1\le s\le t such that \hat {v}_{t} = v_{s} . Hence, \begin{align*} \sqrt {\hat {v}_{t}}=&\sqrt {v_{s}}\\=&\sqrt {1-\beta _{2}}\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}g^{2}_{k}}\\\le&\sqrt {1-\beta _{2}}\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k} (\max _{1\le {j}\le s}{|g_{j}}|)^{2}}\\=&G_{\infty }\sqrt {1-\beta _{2}} \sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}}\\\le&G_{\infty }\sqrt {1-\beta _{2}} \frac {1}{\sqrt {1-\beta _{2}}}\\=&G_{\infty },\end{align*}

View SourceRight-click on figure for MathML and additional features. where the first inequality is by the fact that g_{k}\le \max _{1\le j\le s}{|g_{j}|} , k\in [s] , and the last inequality is by Lemma 2.4.

Lemma 4.3:

If either \beta _{1,t} = \beta _{1}\lambda ^{t-1} or \beta _{1,t} = \beta _{1}/t , then there exists some t_{0} such that for every t > t_{0} , \begin{equation*}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \ge \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

Since \hat {v}_{t,i}\ge \hat {v}_{t-1,i} owing to \hat v_{t} = \max (\hat v_{t-1}, v_{t}) in Algorithm 1, it is sufficient to prove that there exists some t_{0} such that for every t > t_{0} , \begin{equation*} \frac {\sqrt {t}}{1-\beta _{1,t}} \ge \frac {\sqrt {t-1}}{1-\beta _{1,t-1}}.\end{equation*}

View SourceRight-click on figure for MathML and additional features. In other word, \begin{equation*} 1- \frac {\beta _{1,t-1}-\beta _{1,t}}{1-\beta _{1,t}} \ge \sqrt {1-\frac {1}{t}}.\tag{10}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

When \beta _{1,t} = \beta _{1}/t , from (10) we have \begin{equation*} 1- \frac {\beta _{1}}{(t-1)(t-\beta _{1})} \ge \sqrt {1-\frac {1}{t}}.\tag{11}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

When \beta _{1,t} = \beta _{1}\lambda ^{t-1} , (10) have the following form \begin{equation*} 1- \frac {(1- \lambda) \beta _{1}\lambda ^{t-2}}{1-\beta _{1}\lambda ^{t-1}} = \frac {1-\beta _{1} \lambda ^{t-2}}{1-\beta _{1} \lambda ^{t-1}} \ge \sqrt {1 - \frac {1}{t}}.\tag{12}\end{equation*}

View SourceRight-click on figure for MathML and additional features. Since \beta _{1} and \lambda are smaller than 1, it is easy to see that when t is sufficiently large, meaning that t > t_{0} for some t_{0} , the left-hand side of (11) is 1 - O(1/t^{2}) and the left-hand side of (12) is larger than 1 - \beta _{1} \lambda ^{t-2} = 1 - O(\lambda ^{t-2}) . Therefore, (11) and (12) hold when t is sufficiently large.

Lemma 4.4:

For the parameter settings and conditions assumed in Theorem 4.1, we have \begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T +1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

The proof is almost identical to that of [3, Lemma 2]. Owing to \hat v_{t} = \max (\hat v_{t-1}, v_{t}) in Algorithm 1, we have \hat {v}_{t,i} \ge v_{t,i} for all t\ge 1 . Therefore \begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {m^{2}_{t,i}}{\sqrt {t {v}_{t,i}}} \\[-1.6pt]=&\frac {\left[{\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right]^{2}}{\sqrt {(1-\beta _{2})t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}. \tag{13}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Moreover, by Lemma 2.3 we have \begin{align*}&\hspace{-1.5pc}\left({\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right)^{2} \\[-1.6pt]\le&\left ({\sum _{k=1}^{t}(1-\beta _{1,k})^{2}\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)}\right)\left ({\sum _{k=1}^{t}\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}^{2}}\right).\end{align*}

View SourceRight-click on figure for MathML and additional features.

And hence, \begin{align*}&\hspace{-2.8pc}\left({\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right)^{2} \\[-1.6pt]&\qquad \quad \le \, \left ({\sum _{k=1}^{t}\beta _{1}^{t-k}}\right)\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}\right),\tag{14}\end{align*}

View SourceRight-click on figure for MathML and additional features. since \beta _{1,k} \le 1 and \beta _{1,k} \le \beta _{1} for all 1\le k\le T . Combining (13) and (14) we obtain \begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}}\right)\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}\right)}{\sqrt {(1-\beta _{2})t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}} \frac {\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}},\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the last inequality is obtained by applying Lemma 2.4 to \sum _{k=1}^{t}\beta _{1}^{t-k} . Therefore, \begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}\sqrt {t}} \frac {\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\frac {\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1}) \sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\frac {\beta _{1}^{t-k}}{\sqrt {\beta _{2}^{t-k}}} |{g_{k,i}}|\\[-1.6pt]=&\frac {1}{(1-\beta _{1}) \sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\gamma ^{t-k} |{g_{k,i}}|,\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the second inequality is by Lemma 2.7. Therefore \begin{equation*} \sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {1}{(1\!-\!\beta _{1})\sqrt {1\!-\!\beta _{2}}} \sum _{t=1}^{T}\frac {1}{\sqrt {t}}\sum _{k=1}^{t}{\gamma }^{t-k} |{g_{k,i}}|. \tag{15}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
It is sufficient to consider \sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|{g_{k, i}}| . Firstly, \sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|{g_{k, i}}| can be expanded as \begin{align*}&\hspace{-1.5pc}\frac {1}{\sqrt {2}} \biggl (\gamma ^{1}|g_{1, i}| + \gamma ^{0}|{g_{2, i}}| \biggr)\\&+\, \frac {1}{\sqrt {3}} \biggl (\gamma ^{2}|g_{1, i}| + \gamma ^{1}|{g_{2, i}}| + \gamma ^{0}|{g_{3, i}}|\biggl)\\&+\,\cdots \\&+\,\frac {1}{\sqrt {T}} \biggl (\gamma ^{T-1}|g_{1, i}| + \gamma ^{T-2}|{g_{2, i}}| +\ldots + \gamma ^{0}|{g_{T, i}}|\biggl).\end{align*}
View SourceRight-click on figure for MathML and additional features.
Changing the role of |g_{1,i}| as the common factor, we have \sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| is equal to \begin{align*}&\hspace{-2.5pc}|g_{1, i}| \left({\gamma ^{0} + \frac {1}{\sqrt {2}}\gamma ^{1} + \frac {1}{\sqrt {3}}\gamma ^{2} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-1}}\right)\biggr) \\&+\, |{g_{2, i}}| \left({\frac {1}{\sqrt {2}}\gamma ^{0} + \frac {1}{\sqrt {3}}\gamma ^{1} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-2}}\right)\\&+\,|{g_{3, i}}| \left({\frac {1}{\sqrt {3}}\gamma ^{0} + \frac {1}{\sqrt {4}}\gamma ^{1} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-3}}\right)\\&+\, \cdots \\&+\,|{g_{T, i}}| \frac {1}{\sqrt {T}}\gamma ^{0}.\end{align*}
View SourceRight-click on figure for MathML and additional features.
In other words, \begin{equation*}\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| = \sum _{t=1}^{T} |{g_{t, i}}| \sum _{k=t}^{T}\frac {1}{\sqrt {k}}\gamma ^{k-t}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Moreover, since \sum _{k=t}^{T}\frac {1}{\sqrt {k}}\gamma ^{k-t} \le \sum _{k=t}^{T}\frac {1}{\sqrt {t}}\gamma ^{k-t} = \frac {1}{\sqrt {t}}\sum _{k=t}^{T}\gamma ^{k-t} = \frac {1}{\sqrt {t}}\sum _{k=0}^{T-t}\gamma ^{k} \le \frac {1}{\sqrt {t}}\left ({\frac {1}{1-\gamma }}\right) , where the last inequality is by Lemma 2.4, we obtain \begin{align*} \sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}|\le&\sum _{t=1}^{T}|g_{t, i}| \frac {1}{\sqrt {t}}\left ({\frac {1}{1-\gamma }}\right) \\=&\frac {1}{1-\gamma } \sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Furthermore, since \begin{align*} \sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|=&\sqrt {\left ({\sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|}\right)^{2}}\\\le&\sqrt {\sum _{t=1}^{T} \frac {1}{t}} \sqrt {\sum _{t=1}^{T}g_{t, i}^{2}} \\\le&(\sqrt {\ln T +1})\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the first inequality is by Lemma 2.3 and the last inequality is by Lemma 2.5, we obtain \begin{equation*}\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| \le \frac {\sqrt {\ln T +1}}{1-\gamma } \lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Hence, by (15), \begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T + 1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2},\end{equation*}
View SourceRight-click on figure for MathML and additional features.
which ends the proof.

Let us now prove Theorem 4.1.

Proof ofTheorem 4.1:

To prove Theorem 4.1, by Lemma 3.1, we need to bound the terms (6), (7), and (8). First, we consider (7). We have \begin{align*} \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}=&\frac {\alpha }{1-\beta _{1}}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac { m_{t,i}^{2}}{\sqrt {t\hat v_{t,i}}} \\[4.5pt]\le&\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \\[4.5pt]&\times \, \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}, \tag{16}\end{align*}

View SourceRight-click on figure for MathML and additional features. where the equality is by the assumption that \alpha _{t} = \alpha /\sqrt {t} and the last inequality is by Lemma 4.4. Next, we consider (8). The bound for (8) depends on either \beta _{1,t} = \beta _{1}\lambda ^{t-1} (0 < \lambda < 1) or \beta _{1,t} = \frac {\beta _{1}}{t} . Recall that by assumption, \lVert {x_{m}-x_{n}}\rVert _{\infty } \le D_{\infty } for any m,n\in \{1,2,\ldots ,T\} , \alpha _{t} = \alpha /\sqrt {t} . If \beta _{1,t} = \beta _{1}\lambda ^{t-1} (0 < \lambda < 1) , then, \begin{align*}&\hspace{-2.8pc}\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2} \\[4.5pt]=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1}\lambda ^{t-1}\sqrt {(t-1)}\sqrt {\hat {v}_{t-1,i}} }{2\alpha (1-\beta _{1})}(x_{t,i} - {x^{*}_{i}})^{2} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\sqrt {(t-1)} \lambda ^{t-1} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}t \lambda ^{t-1} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \frac {1}{(1-\lambda)^{2}} \\[4.5pt]=&\frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}}, \tag{17}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the first inequality is from Lemma 4.2 and the assumption that \beta _{1}\le 1 , the last inequality is by Lemma 2.4. If \beta _{1,t} = \frac {\beta _{1}}{t} , then, \begin{align*}&\hspace{-3.5pc}\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2} \\=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1}\sqrt {(t-1)}\sqrt {\hat {v}_{t-1,i}} }{2\alpha (1-\beta _{1})t}(x_{t,i} - {x^{*}_{i}})^{2} \\\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {(t-1)}}{t} \\\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {1}{\sqrt {t}} \\=&\frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})}, \tag{18}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the first inequality is from Lemma 4.2 and the assumption that \beta _{1}\le 1 , and the last inequality is by Lemma 2.6.

Finally, we will bound (6). From (9) and replacing \alpha _{t} with \frac {\alpha }{\sqrt {t}} (1\le t \le T) , we obtain \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha (1-\beta _{1})} (x_{1, i} \!-\! {x^{*}_{i}})^{2} \\&+\,\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1\!-\!\beta _{1,t}} \!-\! \frac {\sqrt {(t\!-\!1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right).\end{align*}

View SourceRight-click on figure for MathML and additional features.

By Lemma 4.3, there is some t_{0} (1\le t_{0} \le T) such that \frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \ge \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} for all t>t_{0} . Therefore, \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\[-1.5pt]&+\,\!\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}}\! -\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\[-1.5pt]&+\,\!\frac {1}{2\alpha }\!\!\sum _{i=1}^{d}\!\! \sum _{t=t_{0}+1}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}}\right)\\[-1.5pt]\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} \\[-1.5pt]&+\,\!\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\[-1.5pt]&+\,\!\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=t_{0}+1}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right).\end{align*}

View SourceRight-click on figure for MathML and additional features.

Since \begin{align*}&\hspace{-2.8pc}\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=t_{0}+1}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\=&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} - \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {t_{0}\hat {v}_{t_{0},i}}}{1-\beta _{1, t_{0}}}\\\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}},\end{align*}

View SourceRight-click on figure for MathML and additional features. we have \begin{align*} (6)\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} + \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d}\frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} \\[-1.5pt]&+\, \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}}\! (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right) \\[-1.5pt]\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} + \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d}\frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} \\[-1.5pt]&+\, \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} \frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \\[-1.5pt]\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t}+ \sqrt {T}}\right),\tag{19}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the second inequality is obtained by omitting the term \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} - {x^{*}_{i}})^{2} \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} , and the last inequality is by Lemma 4.2 and the assumption that \beta _{1,t}\le \beta _{1} (1\le t \le T) . Summing up, if \beta _{1,t} = \beta _{1}\lambda ^{t-1} , then, from (16), (17), and (19), we obtain \begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1.5pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}} \\[-1.5pt]&+\,\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}
View SourceRight-click on figure for MathML and additional features.

If \beta _{1,t} = \frac {\beta _{1}}{t} , then, from (16), (18), and (19), we obtain \begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1.5pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})} \\[-1.5pt]&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}

View SourceRight-click on figure for MathML and additional features. which ends the proof.

The following corollary shows that, when either \beta _{1,t} = \beta _{1}\lambda ^{t-1} or \beta _{1,t}= 1/t (1\le t \le T) , where 0\le \beta _{1} < 1 and 0 < \lambda < 1 , the average regret of AMSGrad converges.

Corollary 4.5:

With the same assumption as in Theorem 4.1, AMSGrad achieves the following guarantee:\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T} = 0.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

The result is obtained by using Theorem IV and the following fact:\begin{align*} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}=&\sum _{i=1}^{d}\sqrt {g_{1,i}^{2} + g_{2,i}^{2},\ldots + g_{T,i}^{2}}\\[-1.5pt]\le&\sum _{i=1}^{d}\sqrt {TG_{\infty }^{2}}\\[-1.5pt]=&dG_{\infty }\sqrt {T},\end{align*}

View SourceRight-click on figure for MathML and additional features. where the inequality is from the assumption that \lVert {g_{t}}\rVert _{\infty } \le G_{\infty } for all t\in [T] .

SECTION V.

New Version of AMSGrad Optimizer: AdamX

Let f_{1}, f_{2},\ldots , f_{T}: \mathcal F \to \mathbb R be an arbitrary sequence of convex cost functions. If the system \{\beta _{1,t}\}_{1\le t \le T} is kept arbitrary, as in the setting of Theorem A, to ensure that the regret R(T) satisfies R(T)/T\to 0 , we suggest a new algorithm as follows.

With this Algorithm 2, the regret is bounded as follows.

SECTION Algorithm 2

AdamX: A New Variant of Adam and AMSGrad

Input:

x_{1}\in \mathbb R^{d} , step size \{\alpha _{t}\}_{t=1}^{T}, \{\beta _{1,t}\}_{t=1}^{T}, \beta _{2}

Set m_{0} = 0, v_{0} = 0 , and \hat v_{0} = 0

for (t=1; t\le T; t\gets t+1) do

g_{t} = \nabla f_{t}(x_{t})

m_{t} = \beta _{1,t}\cdot m_{t-1} + (1-\beta _{1,t})\cdot g_{t}

v_{t} = \beta _{2}\cdot v_{t-1} + (1-\beta _{2})\cdot g^{2}_{t}

\begin{equation*} \hat v_{t} = \begin{cases} v_{t} & \text {if } t = 1\\ \max \left\{{\dfrac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1}, v_{t}}\right\} & \text {if } t\ge 2 \end{cases}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

x_{t+1} = \prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot m_{t}/\sqrt {\hat v_{t}}) ,

where \hat V_{t} = \text {diag}(\hat v_{t})

end for

Output:

x_{T+1}

Theorem 5.1:

Let x_{t} and v_{t} be the sequences obtained from Algorithm 2, \alpha _{t} = \frac {\alpha }{\sqrt {t}} , \beta _{1} = \beta _{1,1} , \beta _{1,t} \le \beta _{1} for all t\in [T] and \frac {\beta _{1}}{\sqrt {\beta _{2}}} \le 1 . Assume that \mathcal F has bounded diameter D_{\infty } and \lVert {\nabla f_{t}(x)}\rVert _{\infty } \le G_{\infty } for all t\in [T] and x\in \mathcal F . For x_{t} generated using the AdamX (Algorithm 2), we have the following bound on the regret:\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\sqrt {T} \\&+\, \frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)} \\&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}

View SourceRight-click on figure for MathML and additional features.

To prove Theorem 5.1, we need the following Lemmas 5.2, 5.3, and 5.4.

Lemma 5.2:

For all t\ge 1 , we have \begin{equation*} \hat {v}_{t} = \max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le t-1},\tag{20}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \hat {v}_{t} is in Algorithm 2.

Proof:

We will prove (20) by induction on t . Recall that by the update rule on \hat {v}_{t} , we have \hat v_{1} \mathop{=}\limits^{\Delta } v_{1} and \hat v_{t} \mathop{=}\limits^{\Delta } \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1}, v_{t}}\right\} if t\ge 2 . Therefore, \begin{align*} \hat v_{2}\overset{\Delta }{=}&\max \left \{{\frac {(1-\beta _{1,2})^{2}}{(1-\beta _{1,1})^{2}}\hat v_{1}, v_{2}}\right \}\\=&\max \left \{{\frac {(1-\beta _{1,2})^{2}}{(1-\beta _{1,1})^{2}} v_{1}, v_{2}}\right \}\\=&\max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le 2}.\end{align*}

View SourceRight-click on figure for MathML and additional features. Assume that \begin{equation*}\hat {v}_{t-1} = \max \left \{{\frac {(1-\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le t-1}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
and the (20) holds for all 1\le j\le t-1 . Since \begin{equation*}\hat {v}_{t}\overset {\Delta }{=} \max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat {v}_{t-1}, v_{t}}\right \},\end{equation*}
View SourceRight-click on figure for MathML and additional features.
we have \begin{align*}&\hspace{-1.2pc}\hat {v}_{t} \\=&\max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1\!-\!\beta _{1,t-1})^{2}}\left ({\max \left\{{\frac {(1\!-\!\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}\hat {v}_{s}}\right\}_{1\le s\le t-1}}\right),\! v_{t}}\right \} \\=&\max \Bigg \{ \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\frac {(1-\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}{v}_{s}}\right\}_{1\le s\le t-1}, \\&\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}v_{t}\Bigg \} \\=&\max \left\{{ \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}} {v}_{s}}\right\}_{1\le s\le t-1}, \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}v_{t}}\right\} \\=&\max \left\{{ \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}} {v}_{s}}\right\}_{1\le s\le t-1},\end{align*}
View SourceRight-click on figure for MathML and additional features.
which ends the proof.

Lemma 5.3:

For all t\ge 1 , we have \sqrt {\hat {v}_{t}} \le \frac {G_{\infty }}{1-\beta _{1}} , where \hat {v}_{t} is in Algorithm 2.

Proof:

By Lemma 5.2, \begin{equation*}\hat {v}_{t} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right\}_{1\le s\le t}.\end{equation*}

View SourceRight-click on figure for MathML and additional features. Therefore, there is some s such that 1\le s\le t and \hat {v}_{t} = \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s} . Hence, \begin{align*} \sqrt {\hat {v}_{t}}=&\sqrt {\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}} \\=&\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}g^{2}_{k}}\\\le&\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k} (\max _{1\le k\le s}{|g_{k}|})^{2}}\\=&G_{\infty }\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}}\\\le&G_{\infty }\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\frac {1}{\sqrt {1-\beta _{2}}}\\=&\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)G_{\infty }\\\le&\frac {G_{\infty }}{1-\beta _{1}},\end{align*}
View SourceRight-click on figure for MathML and additional features.
which ends the proof.

Lemma 5.4:

For the parameter settings and conditions assumed in Theorem 5.1, we have \begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T +1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

Since for all t\ge 1 \begin{equation*}\hat {v}_{t,i} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right\}_{1\le s\le t},\end{equation*}

View SourceRight-click on figure for MathML and additional features. by Lemma 5.2, we have \hat {v}_{t,i} \ge v_{t,i} , and hence the proof is the same as that of Lemma 4.4.

Proof ofTheorem 5.1:

Similarly to the proof of Theorem 4.1, we need to bound (6), (7), and (8). By using Lemma 5.4, we obtain the same bound for (7) as in the proof of Theorem 4.1, that is, \begin{align*} (7)=&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\=&\frac {\alpha }{1-\beta _{1}}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac { m_{t,i}^{2}}{\sqrt {t\hat v_{t,i}}}\\\le&\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}

View SourceRight-click on figure for MathML and additional features. where the last inequality is by Lemma 5.4. Now we bound (8). By the assumption that \lVert {x_{m}-x_{n}}\rVert _{\infty } \le D_{\infty } for any m,n\in \{1,\ldots ,T\} , \alpha _{t} = \alpha /\sqrt {t} , and \beta _{1,t} = \beta _{1}\lambda ^{t-1} \le \beta _{1} \le 1 , we obtain \begin{align*} (8)=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t})} (x_{t,i} - {x^{*}_{i}})^{2}\\\le&\frac {D_{\infty }^{2}}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)\hat {v}_{t-1,i}}.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Therefore, from Lemma 5.3, we obtain \begin{equation*} (8) \le \frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})^{2}} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, we will bound (6). By (9) and replacing \alpha _{t} = \frac {\alpha }{\sqrt {t}} (1\le t\le T) , we obtain \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha (1-\beta _{1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\! \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} - {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\end{align*}
View SourceRight-click on figure for MathML and additional features.
Moreover, by the update rule of Algorithm 2, we have \begin{equation*}\hat v_{t,i} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}, v_{t,i}}\right\}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Therefore, \hat v_{t,i} \ge \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i} , and hence \begin{align*}&\hspace{-2.8pc}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} \\\ge&\frac {\sqrt {t\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}\\=&\frac {\sqrt {t \hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} > 0.\end{align*}
View SourceRight-click on figure for MathML and additional features.
Now by the positivity of the essential formula \begin{equation*}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}},\end{equation*}
View SourceRight-click on figure for MathML and additional features.
we obtain \begin{align*} (6)\le&\frac {D_{\infty }^{2} }{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1}} \\+&\frac {D_{\infty }^{2} }{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\=&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac {\sqrt {T\hat {v}_{T,i}} }{1-\beta _{1,T}}\\\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})^{2}}\sqrt {T}~,\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the last inequality is by Lemma 5.3. Hence we obtain the desired upper bound for R(T) .

Corollary 5.5:

With the same assumption as in Theorem 5.1, and for all 0 \le \beta _{1,t} < 1 satisfying \begin{equation*}\lim _{T\to \infty }\frac {\sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}}{T} = 0,\end{equation*}

View SourceRight-click on figure for MathML and additional features. AdamX achieves the following guarantee:\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T}=0.\end{equation*}
View SourceRight-click on figure for MathML and additional features.

Proof:

By Theorem 5.1, it is sufficient to consider the term \begin{equation*}\frac {dD_{\infty }^{2}~G_{\infty }}{2\alpha (1-\beta _{1})^{2}} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. on the right hand side of the upper bound for R(T) in Theorem 5.1. Because \frac {dD_{\infty }^{2}\,\,G_{\infty }}{2\alpha (1-\beta _{1})^{2}} is bounded and does not depend on T , the statement follows.

When either \beta _{1,t} = \beta _{1}\lambda ^{t-1} for some \lambda \in (0,1) , or \beta _{1,t} = \frac {1}{t} in Theorem 5.1, we obtain the following guarantee that the average regret of AdamX converges.

Corollary 5.6:

With the same assumption as in Theorem 5.1, and either \beta _{1,t} = \beta _{1}\lambda ^{t-1} for some \lambda \in (0,1) , or \beta _{1,t} = \frac {1}{t} , AdamX achieves the following guarantee:\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T} = 0.\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Proof:

By Corollary 5.5, it is sufficient to consider the term \begin{equation*}\sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}.\end{equation*}

View SourceRight-click on figure for MathML and additional features. When \beta _{1,t} = \beta _{1}\lambda ^{t-1} for some \lambda \in (0,1) , we have \begin{align*} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}=&\sum _{t=2}^{T}\beta _{1}\lambda ^{t-1}\sqrt {t-1} \\\le&\sum _{t=2}^{T}\sqrt {(t-1)} \lambda ^{t-1} \\\le&\sum _{t=2}^{T}t \lambda ^{t-1} \\\le&\frac {1}{(1-\lambda)^{2}}\tag{21}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the first inequality is from the property that \beta _{1} \le 1 , and the last inequality is from Lemma 2.4. When \beta _{1,t} = \frac {1}{t} , we obtain \begin{align*} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}=&\sum _{t=2}^{T}\frac {\sqrt {t-1}}{t} \\\le&\sum _{t=2}^{T}\frac {1}{\sqrt {t}} \\\le&2\sqrt {T},\tag{22}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where the last inequality is from Lemma 2.6. Now, by combining (21) and (22) with Corollary 5.5, we obtain the desired result.

SECTION VI.

Experiments

While we consider our main contributions as the theoretical analyses on AMSGrad and AdamX in the previous sections, we provide experimental results in this section for AMSGrad and AdamX. Concretely, we use the PyTorch code for AMSGrad2 via setting the Boolean flag amsgrad = True. The code for AdamX is based on that of AMSGrad, with corresponding modifications as in Algorithm 2. The parameters for AMSGrad and AdamX are identical in our experiments, namely (\beta _{1}, \beta _{2})=(0.9, 0.999) , the term added to the denominator to improve numerical stability is \epsilon = 10^{-8} , and additionally we set \beta _{1,t} = \beta _{1}\lambda ^{t-1} with \lambda =0.001 to make use of Corollary 5.6 on the convergence of AdamX.

The learning rate is scheduled for both optimizers AMSGrad and AdamX as follows: 10−3, 10−4, 10−5, 10−6, 10^{-6}/2 if the epoch is correspondingly in the ranges [0, 80], [0, 80], [81, 120], [121, 160], [161, 180], [181, 200]. We use CIFAR3-10 (containing 50000 training images and 10000 test images of size 32\times 32 ) as the dataset and the residual networks ResNet18 [8] and PreActResNet18 [9] for training with batch size is 128. The testing result is given in Figure 1 where one can see that AMSGrad and AdamX behaves similarly, which supports our theoretical results on the convergence of both AMSGrad (Section IV) and AdamX (Section V).

FIGURE 1. - Testing accuracies over CIFAR-10 using AMSGrad and AdamX, with different neural network models.
FIGURE 1.

Testing accuracies over CIFAR-10 using AMSGrad and AdamX, with different neural network models.

SECTION VII.

Conclusion

We have shown that the convergence proof of AMSGrad [3] is problematic, and presented various fixes for it, which include a new and slightly modified version called AdamX. Our work helps ensure the theoretical foundation of those optimizers.

References

References is not available for this document.