Introduction and Our Contributions
One of the most popular algorithms for training deep neural networks is stochastic gradient descent (SGD) [1] and its variants. Among the various variants of SGD, the algorithm with the adaptive moment estimation Adam [2] is widely used in practice. However, Reddi et al. [3] have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad to solve this issue.
Our contribution. In this paper, we point out a flaw in the convergence proof of AMSGrad, recalled as Theorem A below. We then fix this flaw by providing a new convergence proof for AMSGrad in the case of special parameters. In addition, in the case of general parameters, we propose a new and slightly modified version of AMSGrad.
To provide more details, let us recall AMSGrad in Algorithm 1, in which the mathematical notation can be fully found in Section II.
AMSGrad (Reddi et al. [3])
Set
for
end for
The main theorem for the convergence of AMSGrad in [3] is as follows. To simplify the notation, we define
Theorem A [Theorem 4 in[3], problematic]:
Let \begin{align*} R(T)\le&\frac {D_{\infty }^{2}\sqrt {T}}{\alpha (1-\beta _{1})}\sum \limits _{i=1}^{d} \sqrt {\hat {v}_{T,i}} \\[-2pt]&+\,\frac {D_{\infty }^{2}}{2(1-\beta _{1})} \sum \limits _{i=1}^{d} \sum _{t=1}^{T}\frac {\beta _{1,t}\sqrt {\hat v_{t,i}}}{\alpha _{t}} \\[-2pt]&+\,\frac {\alpha \sqrt { 1+\ln T}}{(1-\beta _{1})^{2}(1-\gamma)\sqrt {1-\beta _{2}}} \sum \limits _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}
In their proof for Theorem A, Reddi et al. resolved an issue on the so-called telescopic sum in the convergence proof of Adam ([2, Theorem 10.5]). Specifically, Reddi et al. adjusted \begin{equation*} {\frac {\sqrt {\hat v_{t+1,i}}}{\alpha _{t+1}} \ge \frac {\sqrt {\hat v_{t,i}}}{\alpha _{t}} }\tag{1}\end{equation*}
\begin{equation*} {\frac {\sqrt {\hat {v}_{t+1,i}}}{\alpha _{t+1}(1-\boxed {\beta _{1, t+1}})} \ge \frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}(1-\boxed {\beta _{1, t}})}}\end{equation*}
Paper roadmap. We begin with preliminaries in Section II. We show where the proof of Theorem A becomes invalid in Section III. After that, we suggest two ways to resolve the issue in Sections IV and V.
Preliminaries
Notation. Given a sequence of vectors
Optimization setup. Let \begin{equation*} R(T) = \sum _{t=1}^{T}[f_{t}(x_{t}) -f_{t}(x^{*})],\end{equation*}
Definition 2.1:
A function \begin{equation*} \lambda f(x) + (1-\lambda)f(y) \ge f(\lambda x + (1-\lambda)y).\end{equation*}
Lemma 2.2:
If a function \begin{equation*} f(y) \ge f(x) + \nabla f(x)^{\sf T}(y-x),\end{equation*}
Lemma 2.3 (Cauchy–Schwarz inequality):
For all \begin{equation*} \left ({\sum _{i=1}^{n} u_{i}v_{i}}\right)^{2} \le \left ({\sum _{i=1}^{n}u_{i}^{2}}\right) \left ({\sum _{i=1}^{n}v_{i}^{2}}\right).\end{equation*}
Lemma 2.4 (Taylor series):
For \begin{equation*} \sum _{t\ge 1}{\alpha ^{t}} = \frac {1}{1-\alpha }\end{equation*}
\begin{equation*} \sum _{t\ge 1}{t\alpha ^{t-1}} = \frac {1}{(1-\alpha)^{2}}.\end{equation*}
Lemma 2.5 (Upper bound for the harmonic series):
For \begin{equation*} \sum _{n=1}^{N} \frac {1}{n}\le \ln N +1.\end{equation*}
Lemma 2.6:
For \begin{equation*} \sum _{n=1}^{N} \frac {1}{\sqrt {n}}\le 2\sqrt {N}.\end{equation*}
Lemma 2.7:
For all \begin{equation*} \frac {\sum _{i=1}^{n}a_{i}}{\sum _{j=1}^{n} b_{j}} \le \sum _{i=1}^{n}\frac {a_{i}}{b_{i}}.\end{equation*}
Lemma 2.8:
[4, Lemma 3 in arXiv version] For any \begin{equation*} \lVert Q ^{1/2}(u_{1}-u_{2})\rVert \le \lVert Q ^{1/2}(z_{1}-z_{2})\rVert.\end{equation*}
Issue in the Convergence Proof of AMSGrad
Before showing the issue in the convergence proof of AMSGrad, let us recall and prove the following inequality, which also appears in [3].
Lemma 3.1:
Algorithm 1 achieves the following guarantee, for all \begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}((x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2})}{ 2\alpha _{t}(1-\beta _{1,t})} \\&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2}.\end{align*}
Proof:
We note that \begin{align*} x_{t+1}=&\prod _{\mathcal F, \sqrt {\hat V_{t}}}(x_{t} - \alpha _{t} \cdot {\hat V^{-1/2}_{t}}m_{t}) \\[-2.5pt]=&\mathrm {argmin}_{x\in \mathcal F}\lVert \hat V^{1/4}(x-(x_{t} - \alpha _{t} \hat V^{-1/2}m_{t}))\rVert\end{align*}
\begin{align*}&\hspace{-1.8pc}\rlap{\text{$\displaystyle \lVert \hat V^{1/4}(x_{t+1} - x^{*}) \rVert ^{2} $}}\qquad \\[-2.5pt]\le&\lVert \hat V^{1/4}(x_{t} - \alpha _{t} \hat V^{-1/2}m_{t} - x^{*}) \rVert ^{2} \\[-2.5pt]=&\lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} \! +\! \alpha ^{2}_{t} \lVert \hat V^{-1/4}m_{t}\rVert ^{2} \!-\! 2\alpha _{t}\langle m_{t}, x_{t}- x^{*}\rangle \\[-2.5pt]=&\lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} + \alpha ^{2}_{t} \lVert \hat V^{-1/4}m_{t}\rVert ^{2} \\[-2.5pt]&-\,2\alpha _{t}\langle \beta _{1,t}m_{t-1} + (1-\beta _{1,t})g_{t}, x_{t}- x^{*}\rangle.\end{align*}
\begin{align*} \langle g_{t}, x_{t}- x^{*}\rangle\le&\frac {1}{2\alpha _{t}(1-\beta _{1,t})}\Big [ \lVert \hat V^{1/4}(x_{t} - x^{*}) \rVert ^{2} \\[-2.5pt]&\qquad -\lVert \hat V^{1/4}(x_{t+1} - x^{*}) \rVert ^{2} \Big] \\[-2.5pt]&+\,\frac {\alpha _{t}}{2(1-\beta _{1,t})}\lVert \hat V^{-1/4}m_{t}\rVert ^{2} \\[-2.5pt]&-\,\frac {\beta _{1,t}}{1-\beta _{1,t}}\langle m_{t-1}, x_{t}- x^{*}\rangle.\end{align*}
\begin{align*}&\hspace{-1.2pc}\rlap{\text{$\displaystyle \sum _{i=1}^{d} g_{t,i} (x_{t,i} - {x^{*}_{i}}) $}}\qquad \\[-2.5pt]\le&\sum _{i=1}^{d} \frac {\sqrt {\hat v_{t,i}}}{2\alpha _{t}(1-\beta _{1,t}) } \Big ((x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2}\Big) \\[-2.5pt]&+\,\sum _{i=1}^{d} \frac {\alpha _{t}}{2(1-\beta _{1,t}) } \frac { m^{2}_{t,i}}{\sqrt {\hat v_{t,i}}} - \sum _{i=1}^{d}\frac {\beta _{1,t}}{1 - \beta _{1,t}}m_{t-1,i}(x_{t,i} - {x^{*}_{i}}). \\[-2.5pt]\tag{2}\end{align*}
\begin{equation*} f_{t}(x_{t}) - f_{t}(x^{*}) \le g_{t}^{\sf T}(x_{t}-x^{*}) = \sum _{i=1}^{d} g_{t,i}(x_{t,i}-{x^{*}_{i}}).\end{equation*}
\begin{align*} R(T)=&\sum _{t=1}^{T} [f_{t}(x_{t}) - f_{t}(x^{*})] \\[-2.5pt]\le&\sum _{t=1}^{T}g_{t}^{\sf T} (x_{t} - x^{*}) = \sum _{t=1}^{T}\sum _{i=1}^{d}g_{t,i} (x_{t,i} - {x^{*}_{i}}).\tag{3}\end{align*}
\begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{2\alpha _{t}(1-\beta _{1,t}) } ((x_{t,i}\! -\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2}) \\[-2.5pt]&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t}) } \frac { m^{2}_{t,i}}{\sqrt {\hat v_{t,i}}} \\[-2.5pt]&+\,\sum _{i=1}^{d} \sum _{t=2}^{T} \frac {\beta _{1,t}}{1-\beta _{1,t}}m_{t-1,i}({x^{*}_{i}} - x_{t,i}).\end{align*}
\begin{align*} m_{t-1,t}({x^{*}_{i}} \!-\! x_{t,i})=&\frac {(\hat {v}_{t-1,i})^{1/4}}{\sqrt {\alpha _{t-1}}} ({x^{*}_{i}} \!-\! x_{t,i}) \sqrt {\alpha _{t-1}} \frac {m_{t-1,i}}{(\hat {v}_{t-1,i})^{1/4}} \\\le&\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}} (x_{t,i} \!-\! {x^{*}_{i}})^{2} + {\alpha _{t-1}} \frac {m^{2}_{t-1,i}}{2\sqrt {\hat {v}_{t-1,i}}},\end{align*}
\begin{align*} R(T)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1,t}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right) \\&+\,\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1-\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t})} (x_{t,i} - {x^{*}_{i}})^{2}.\tag{4}\end{align*}
\begin{align*} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1\!-\!\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}}=&\sum _{i=1}^{d} \sum _{t=1}^{T-1}\frac {\beta _{1,t+1}\alpha _{t}}{2(1\!-\!\beta _{1,t+1})} \frac {m^{2}_{t,i}}{\sqrt {\hat {v}_{t,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1\!-\!\beta _{1,t+1})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}},\end{align*}
\begin{align*}&\hspace{-1.8pc}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{2(1-\beta _{1,t})} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} + \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\alpha _{t-1}}{2(1-\beta _{1,t})} \frac {m^{2}_{t-1,i}}{\sqrt {\hat {v}_{t-1,i}}} \\\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}.\tag{5}\end{align*}
\begin{align*} R(T) \le & \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1,t}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right)\\ \quad &+ \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}\\ \quad & + \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2},\end{align*}
Issue in the convergence proof of AMSGrad. We denote the terms on the right hand-side of the upper bound for \begin{align*} &\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1\!-\!\beta _{1,t}) }\left ({(x_{t,i} \!- \!{x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right),\tag{6}\\&\qquad \qquad\quad\quad \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}},\tag{7}\end{align*}
\begin{equation*} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2}.\tag{8}\end{equation*}
The issue in the proof of the convergence theorem of AMSGrad [3, Theorem 4] becomes on examining the term (6). Indeed, in [3, page 18], Reddi et al. used1 the property that \begin{equation*} \frac {1}{1-\beta _{1,t}} \leq \frac {1}{1-\beta _{1}},\end{equation*}
\begin{align*} (6)\le&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\sqrt {\hat v_{t,i}}}{ 2\alpha _{t}(1-\beta _{1}) }\left ({(x_{t,i} \!-\! {x^{*}_{i}})^{2} \!-\! (x_{t+1, i} \!-\! {x^{*}_{i}})^{2} }\right) \\\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\frac {1}{2(1\!-\!\beta _{1})}\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}} \!-\! \frac {\sqrt {\hat {v}_{t-1,i}}}{\alpha _{t-1}} }\right).\end{align*}
\begin{equation*}(x_{t,i} - {x^{*}_{i}})^{2} - (x_{t+1, i} - {x^{*}_{i}})^{2}\end{equation*}
Counter-example III.2 (For AMSGrad Convergence Proof): We use the function in the Synthetic Experiment of Reddi et al. [3, Page 6] \begin{equation*} f_{t}(x)=\begin{cases}{1010 x,} & t~{{~\text {mod }} 101=1} \\ {-10 x,} &~{{~\text {otherwise }}}\end{cases}\end{equation*}
\begin{align*} g_{1}=&\nabla f_{1}(x_{1}) = 1010,\\ m_{1}=&\beta _{1,1}m_{0} + (1-\beta _{1,1})g_{1} = (1-0.9)1010 = 101,\\ v_{1}=&\beta _{2}v_{0} + (1-\beta _{2})g_{1}^{2} = (1-0.999)1010^{2} = 1020.1,\\ \hat v_{1}=&\max (\hat v_{0}, v_{1}) = v_{1}.\end{align*}
\begin{align*} x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}=&1-(0.001)101/\sqrt {1020.1}\\=&0.9968377223398316.\end{align*}
\begin{align*} x_{2}=&\prod _{\mathcal F}(x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}) \\=&\min (1, x_{1} - \alpha _{1}~m_{1}/\sqrt {\hat v_{1}}) \\=&0.9968377223398316.\end{align*}
\begin{equation*} (x_{1} - x^{*})^{2} - (x_{2} - x^{*})^{2} = 0.001264811064067839 >0.\end{equation*}
\begin{align*} g_{2}=&-10,\\ m_{2}=&\beta _{1,2}m_{1} + (1-\beta _{1,2})g_{2} \\=&(0.9)(0.001)(101) + [1-(0.9)(0.001)](-10) \\=&-9.9001,\\ v_{2}=&\beta _{2}v_{1} + (1-\beta _{2})g_{2}^{2}\\=&(0.999)(1020.1) + (1-0.999)(-10)^{2} \\=&1019.1799000000001,\\ \hat v_{2}=&\max (\hat v_{1}, v_{2}) = v_{1}\\=&1020.1.\end{align*}
\begin{align*} x_{2} \!-\! \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}=&0.9968377223398316 \!-\! \frac {0.001}{\sqrt {2}} \frac {(-9.9001)}{\sqrt {1020.1}} \\=&0.9970569034941291.\end{align*}
\begin{align*} x_{3}=&\prod _{\mathcal F}(x_{2} - \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}) \\=&\min (1, x_{2} - \alpha _{2}~m_{2}/\sqrt {\hat v_{2}}) \\=&0.9970569034941291.\end{align*}
\begin{equation*} (x_{2} - x^{*})^{2} - (x_{3} - x^{*})^{2} = -0.0008753864342319062 < 0.\end{equation*}
Outline of our solution. Let us rewrite (6) as \begin{align*} (6)=&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\sum _{i=1}^{d} \sum _{t=2}^{T}\frac { \sqrt {\hat {v}_{t,i}}}{2\alpha _{t}(1-\beta _{1,t})} (x_{t, i} - {x^{*}_{i}})^{2} \\&- \, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t-1})} (x_{t, i} - {x^{*}_{i}})^{2} \\&- \, \sum _{i=1}^{d}\frac {\sqrt {\hat {v}_{T,i}}}{2\alpha _{T}(1-\beta _{1,T})} (x_{T+1, i} - {x^{*}_{i}})^{2}\\\le&{\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2}} \\[-1pt]&+ \, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac { \sqrt {\hat {v}_{t,i}}}{2\alpha _{t}(1-\beta _{1,t})} (x_{t, i} - {x^{*}_{i}})^{2} \\[-1pt]&-\, \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t-1})} (x_{t, i} - {x^{*}_{i}})^{2}.\end{align*}
Therefore, \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\[-1pt]&+ \,\frac {1}{2}\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \Biggl (\frac {\sqrt {\hat {v}_{t,i}}}{\alpha _{t}(1-\boxed {\beta _{1,t}})} \\[-1pt]&\qquad -\, \frac {\sqrt {\hat {v}_{t-1,i}}}{\alpha _{t-1}(1\!-\!\boxed {\beta _{1,t-1}})} \Biggr), \tag{9}\end{align*}
We suggest two ways to overcome these differences depending on the setting of
In Section IV: If either
or\beta _{1,t} \mathop{=}\limits^{\Delta } \beta _{1}\lambda ^{t-1} ,\beta _{1,t} \mathop{=}\limits^{\Delta }1/t , where(1\le t \le T) and0\le \beta _{1} < 1 , then we give a new convergence theorem for AMSGrad in Section IV.0 < \lambda < 1 In Section V: If the setting for
is general, as in the statement of Theorem A, then we suggest a new (slightly modified) version for AMSGrad in Section V.\beta _{1,t} (1\le t \le T)
New Convergence Theorem for AMSGrad
When either
Theorem 4.1 (Fixes for Theorem A):
Let \begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}} \\[-1pt]&+ \,\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})} \\[-1pt]&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
To prove Theorem 4.1, we need the following Lemmas 4.2, 4.3, and 4.4.
Lemma 4.2:
Proof:
From the definition of \begin{align*} \sqrt {\hat {v}_{t}}=&\sqrt {v_{s}}\\=&\sqrt {1-\beta _{2}}\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}g^{2}_{k}}\\\le&\sqrt {1-\beta _{2}}\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k} (\max _{1\le {j}\le s}{|g_{j}}|)^{2}}\\=&G_{\infty }\sqrt {1-\beta _{2}} \sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}}\\\le&G_{\infty }\sqrt {1-\beta _{2}} \frac {1}{\sqrt {1-\beta _{2}}}\\=&G_{\infty },\end{align*}
Lemma 4.3:
If either \begin{equation*}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \ge \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}.\end{equation*}
Proof:
Since \begin{equation*} \frac {\sqrt {t}}{1-\beta _{1,t}} \ge \frac {\sqrt {t-1}}{1-\beta _{1,t-1}}.\end{equation*}
\begin{equation*} 1- \frac {\beta _{1,t-1}-\beta _{1,t}}{1-\beta _{1,t}} \ge \sqrt {1-\frac {1}{t}}.\tag{10}\end{equation*}
When \begin{equation*} 1- \frac {\beta _{1}}{(t-1)(t-\beta _{1})} \ge \sqrt {1-\frac {1}{t}}.\tag{11}\end{equation*}
When \begin{equation*} 1- \frac {(1- \lambda) \beta _{1}\lambda ^{t-2}}{1-\beta _{1}\lambda ^{t-1}} = \frac {1-\beta _{1} \lambda ^{t-2}}{1-\beta _{1} \lambda ^{t-1}} \ge \sqrt {1 - \frac {1}{t}}.\tag{12}\end{equation*}
Lemma 4.4:
For the parameter settings and conditions assumed in Theorem 4.1, we have \begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T +1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}
Proof:
The proof is almost identical to that of [3, Lemma 2]. Owing to \begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {m^{2}_{t,i}}{\sqrt {t {v}_{t,i}}} \\[-1.6pt]=&\frac {\left[{\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right]^{2}}{\sqrt {(1-\beta _{2})t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}. \tag{13}\end{align*}
Moreover, by Lemma 2.3 we have \begin{align*}&\hspace{-1.5pc}\left({\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right)^{2} \\[-1.6pt]\le&\left ({\sum _{k=1}^{t}(1-\beta _{1,k})^{2}\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)}\right)\left ({\sum _{k=1}^{t}\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}^{2}}\right).\end{align*}
And hence, \begin{align*}&\hspace{-2.8pc}\left({\sum _{k=1}^{t}(1-\beta _{1,k})\left({\prod _{j=k+1}^{t}\beta _{1,j}}\right)g_{k,i}}\right)^{2} \\[-1.6pt]&\qquad \quad \le \, \left ({\sum _{k=1}^{t}\beta _{1}^{t-k}}\right)\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}\right),\tag{14}\end{align*}
\begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}}\right)\left ({\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}\right)}{\sqrt {(1-\beta _{2})t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}} \frac {\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {t\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}},\end{align*}
\begin{align*} \frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}}\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}\sqrt {t}} \frac {\sum _{k=1}^{t}\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {\sum _{k=1}^{t}\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1})\sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\frac {\beta _{1}^{t-k}g_{k,i}^{2}}{\sqrt {\beta _{2}^{t-k}g^{2}_{k,i}}}\\[-1.6pt]\le&\frac {1}{(1-\beta _{1}) \sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\frac {\beta _{1}^{t-k}}{\sqrt {\beta _{2}^{t-k}}} |{g_{k,i}}|\\[-1.6pt]=&\frac {1}{(1-\beta _{1}) \sqrt {1-\beta _{2}}\sqrt {t}}\sum _{k=1}^{t}\gamma ^{t-k} |{g_{k,i}}|,\end{align*}
\begin{equation*} \sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {1}{(1\!-\!\beta _{1})\sqrt {1\!-\!\beta _{2}}} \sum _{t=1}^{T}\frac {1}{\sqrt {t}}\sum _{k=1}^{t}{\gamma }^{t-k} |{g_{k,i}}|. \tag{15}\end{equation*}
\begin{align*}&\hspace{-1.5pc}\frac {1}{\sqrt {2}} \biggl (\gamma ^{1}|g_{1, i}| + \gamma ^{0}|{g_{2, i}}| \biggr)\\&+\, \frac {1}{\sqrt {3}} \biggl (\gamma ^{2}|g_{1, i}| + \gamma ^{1}|{g_{2, i}}| + \gamma ^{0}|{g_{3, i}}|\biggl)\\&+\,\cdots \\&+\,\frac {1}{\sqrt {T}} \biggl (\gamma ^{T-1}|g_{1, i}| + \gamma ^{T-2}|{g_{2, i}}| +\ldots + \gamma ^{0}|{g_{T, i}}|\biggl).\end{align*}
\begin{align*}&\hspace{-2.5pc}|g_{1, i}| \left({\gamma ^{0} + \frac {1}{\sqrt {2}}\gamma ^{1} + \frac {1}{\sqrt {3}}\gamma ^{2} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-1}}\right)\biggr) \\&+\, |{g_{2, i}}| \left({\frac {1}{\sqrt {2}}\gamma ^{0} + \frac {1}{\sqrt {3}}\gamma ^{1} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-2}}\right)\\&+\,|{g_{3, i}}| \left({\frac {1}{\sqrt {3}}\gamma ^{0} + \frac {1}{\sqrt {4}}\gamma ^{1} +\ldots + \frac {1}{\sqrt {T}}\gamma ^{T-3}}\right)\\&+\, \cdots \\&+\,|{g_{T, i}}| \frac {1}{\sqrt {T}}\gamma ^{0}.\end{align*}
\begin{equation*}\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| = \sum _{t=1}^{T} |{g_{t, i}}| \sum _{k=t}^{T}\frac {1}{\sqrt {k}}\gamma ^{k-t}\end{equation*}
\begin{align*} \sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}|\le&\sum _{t=1}^{T}|g_{t, i}| \frac {1}{\sqrt {t}}\left ({\frac {1}{1-\gamma }}\right) \\=&\frac {1}{1-\gamma } \sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|.\end{align*}
\begin{align*} \sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|=&\sqrt {\left ({\sum _{t=1}^{T} \frac {1}{\sqrt {t}} |{g_{t, i}}|}\right)^{2}}\\\le&\sqrt {\sum _{t=1}^{T} \frac {1}{t}} \sqrt {\sum _{t=1}^{T}g_{t, i}^{2}} \\\le&(\sqrt {\ln T +1})\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
\begin{equation*}\sum _{t=1}^{T} \frac {1}{\sqrt {t}} \sum _{k=1}^{t} \gamma ^{t-k}|g_{k, i}| \le \frac {\sqrt {\ln T +1}}{1-\gamma } \lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}
\begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T + 1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2},\end{equation*}
Let us now prove Theorem 4.1.
Proof ofTheorem 4.1:
To prove Theorem 4.1, by Lemma 3.1, we need to bound the terms (6), (7), and (8). First, we consider (7). We have \begin{align*} \sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}}=&\frac {\alpha }{1-\beta _{1}}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac { m_{t,i}^{2}}{\sqrt {t\hat v_{t,i}}} \\[4.5pt]\le&\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \\[4.5pt]&\times \, \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}, \tag{16}\end{align*}
\begin{align*}&\hspace{-2.8pc}\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2} \\[4.5pt]=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1}\lambda ^{t-1}\sqrt {(t-1)}\sqrt {\hat {v}_{t-1,i}} }{2\alpha (1-\beta _{1})}(x_{t,i} - {x^{*}_{i}})^{2} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\sqrt {(t-1)} \lambda ^{t-1} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}t \lambda ^{t-1} \\[4.5pt]\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \frac {1}{(1-\lambda)^{2}} \\[4.5pt]=&\frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}}, \tag{17}\end{align*}
\begin{align*}&\hspace{-3.5pc}\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1})} (x_{t,i} - {x^{*}_{i}})^{2} \\=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1}\sqrt {(t-1)}\sqrt {\hat {v}_{t-1,i}} }{2\alpha (1-\beta _{1})t}(x_{t,i} - {x^{*}_{i}})^{2} \\\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\sqrt {(t-1)}}{t} \\\le&\frac {D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\frac {1}{\sqrt {t}} \\=&\frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})}, \tag{18}\end{align*}
Finally, we will bound (6). From (9) and replacing \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha (1-\beta _{1})} (x_{1, i} \!-\! {x^{*}_{i}})^{2} \\&+\,\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1\!-\!\beta _{1,t}} \!-\! \frac {\sqrt {(t\!-\!1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right).\end{align*}
By Lemma 4.3, there is some \begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha _{1}(1-\beta _{1,1})} (x_{1, i} - {x^{*}_{i}})^{2} \\[-1.5pt]&+\,\!\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}}\! -\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\[-1.5pt]&+\,\!\frac {1}{2\alpha }\!\!\sum _{i=1}^{d}\!\! \sum _{t=t_{0}+1}^{T} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}}\right)\\[-1.5pt]\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} \\[-1.5pt]&+\,\!\frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\[-1.5pt]&+\,\!\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=t_{0}+1}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right).\end{align*}
Since \begin{align*}&\hspace{-2.8pc}\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=t_{0}+1}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\=&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} - \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {t_{0}\hat {v}_{t_{0},i}}}{1-\beta _{1, t_{0}}}\\\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}},\end{align*}
\begin{align*} (6)\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} + \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d}\frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} \\[-1.5pt]&+\, \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}}\! (x_{t, i} \!-\! {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right) \\[-1.5pt]\le&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1,1}} + \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d}\frac { \sqrt {T\hat {v}_{T,i}}}{1-\beta _{1, T}} \\[-1.5pt]&+\, \frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{t_{0}} \frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \\[-1.5pt]\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t}+ \sqrt {T}}\right),\tag{19}\end{align*}
\begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1.5pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})(1-\lambda)^{2}} \\[-1.5pt]&+\,\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}
If \begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\left ({\sum _{t=1}^{t_{0}} \sqrt {t} + \sqrt {T}}\right)\\[-1.5pt]&+\, \frac {d D_{\infty }^{2}G_{\infty }\sqrt {T}}{\alpha (1-\beta _{1})} \\[-1.5pt]&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
The following corollary shows that, when either
Corollary 4.5:
With the same assumption as in Theorem 4.1, AMSGrad achieves the following guarantee:\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T} = 0.\end{equation*}
Proof:
The result is obtained by using Theorem IV and the following fact:\begin{align*} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}=&\sum _{i=1}^{d}\sqrt {g_{1,i}^{2} + g_{2,i}^{2},\ldots + g_{T,i}^{2}}\\[-1.5pt]\le&\sum _{i=1}^{d}\sqrt {TG_{\infty }^{2}}\\[-1.5pt]=&dG_{\infty }\sqrt {T},\end{align*}
New Version of AMSGrad Optimizer: AdamX
Let
With this Algorithm 2, the regret is bounded as follows.
AdamX: A New Variant of Adam and AMSGrad
Set
for
\begin{equation*} \hat v_{t} = \begin{cases} v_{t} & \text {if } t = 1\\ \max \left\{{\dfrac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1}, v_{t}}\right\} & \text {if } t\ge 2 \end{cases}\end{equation*}
where
end for
Theorem 5.1:
Let \begin{align*} R(T)\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})}\sqrt {T} \\&+\, \frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)} \\&+\, \frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2}.\end{align*}
To prove Theorem 5.1, we need the following Lemmas 5.2, 5.3, and 5.4.
Lemma 5.2:
For all \begin{equation*} \hat {v}_{t} = \max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le t-1},\tag{20}\end{equation*}
Proof:
We will prove (20) by induction on \begin{align*} \hat v_{2}\overset{\Delta }{=}&\max \left \{{\frac {(1-\beta _{1,2})^{2}}{(1-\beta _{1,1})^{2}}\hat v_{1}, v_{2}}\right \}\\=&\max \left \{{\frac {(1-\beta _{1,2})^{2}}{(1-\beta _{1,1})^{2}} v_{1}, v_{2}}\right \}\\=&\max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le 2}.\end{align*}
\begin{equation*}\hat {v}_{t-1} = \max \left \{{\frac {(1-\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right \}_{1\le s\le t-1}\end{equation*}
\begin{equation*}\hat {v}_{t}\overset {\Delta }{=} \max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat {v}_{t-1}, v_{t}}\right \},\end{equation*}
\begin{align*}&\hspace{-1.2pc}\hat {v}_{t} \\=&\max \left \{{\frac {(1-\beta _{1,t})^{2}}{(1\!-\!\beta _{1,t-1})^{2}}\left ({\max \left\{{\frac {(1\!-\!\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}\hat {v}_{s}}\right\}_{1\le s\le t-1}}\right),\! v_{t}}\right \} \\=&\max \Bigg \{ \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\frac {(1-\beta _{1,t-1})^{2}}{(1-\beta _{1,s})^{2}}{v}_{s}}\right\}_{1\le s\le t-1}, \\&\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}v_{t}\Bigg \} \\=&\max \left\{{ \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}} {v}_{s}}\right\}_{1\le s\le t-1}, \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}v_{t}}\right\} \\=&\max \left\{{ \frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}} {v}_{s}}\right\}_{1\le s\le t-1},\end{align*}
Lemma 5.3:
For all
Proof:
By Lemma 5.2, \begin{equation*}\hat {v}_{t} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right\}_{1\le s\le t}.\end{equation*}
\begin{align*} \sqrt {\hat {v}_{t}}=&\sqrt {\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}} \\=&\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}g^{2}_{k}}\\\le&\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k} (\max _{1\le k\le s}{|g_{k}|})^{2}}\\=&G_{\infty }\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\sqrt {\sum _{k=1}^{s}\beta _{2}^{s-k}}\\\le&G_{\infty }\sqrt {1-\beta _{2}}\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)\frac {1}{\sqrt {1-\beta _{2}}}\\=&\left ({\frac {1-\beta _{1,t}}{1-\beta _{1,s}}}\right)G_{\infty }\\\le&\frac {G_{\infty }}{1-\beta _{1}},\end{align*}
Lemma 5.4:
For the parameter settings and conditions assumed in Theorem 5.1, we have \begin{equation*}\sum _{t=1}^{T}\frac {m^{2}_{t,i}}{\sqrt {t\hat {v}_{t,i}}} \le \frac {\sqrt { \ln T +1} }{(1-\beta _{1})\sqrt {1-\beta _{2}}(1-\gamma)}\lVert {g_{1:T, i}}\rVert _{2}.\end{equation*}
Proof:
Since for all \begin{equation*}\hat {v}_{t,i} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,s})^{2}}v_{s}}\right\}_{1\le s\le t},\end{equation*}
Proof ofTheorem 5.1:
Similarly to the proof of Theorem 4.1, we need to bound (6), (7), and (8). By using Lemma 5.4, we obtain the same bound for (7) as in the proof of Theorem 4.1, that is, \begin{align*} (7)=&\sum _{i=1}^{d} \sum _{t=1}^{T} \frac {\alpha _{t}}{1-\beta _{1}} \frac { m_{t,i}^{2}}{\sqrt {\hat v_{t,i}}} \\=&\frac {\alpha }{1-\beta _{1}}\sum _{i=1}^{d} \sum _{t=1}^{T} \frac { m_{t,i}^{2}}{\sqrt {t\hat v_{t,i}}}\\\le&\frac {\alpha \sqrt { \ln T +1}}{(1-\beta _{1})^{2}\sqrt {1-\beta _{2}}(1-\gamma)} \sum _{i=1}^{d}\lVert {g_{1:T, i}}\rVert _{2},\end{align*}
\begin{align*} (8)=&\sum _{i=1}^{d} \sum _{t=2}^{T}\frac {\beta _{1,t}\sqrt {\hat {v}_{t-1,i}}}{2\alpha _{t-1}(1-\beta _{1,t})} (x_{t,i} - {x^{*}_{i}})^{2}\\\le&\frac {D_{\infty }^{2}}{2\alpha (1-\beta _{1})} \sum _{i=1}^{d} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)\hat {v}_{t-1,i}}.\end{align*}
\begin{equation*} (8) \le \frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})^{2}} \sum _{t=2}^{T}\beta _{1,t}\sqrt {(t-1)}.\end{equation*}
\begin{align*} (6)\le&\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{2\alpha (1-\beta _{1})} (x_{1, i} - {x^{*}_{i}})^{2} \\&+\,\! \frac {1}{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} (x_{t, i} - {x^{*}_{i}})^{2} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} \!-\! \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\end{align*}
\begin{equation*}\hat v_{t,i} = \max \left\{{\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}, v_{t,i}}\right\}.\end{equation*}
\begin{align*}&\hspace{-2.8pc}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} \\\ge&\frac {\sqrt {t\frac {(1-\beta _{1,t})^{2}}{(1-\beta _{1,t-1})^{2}}\hat v_{t-1,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}}\\=&\frac {\sqrt {t \hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} > 0.\end{align*}
\begin{equation*}\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}},\end{equation*}
\begin{align*} (6)\le&\frac {D_{\infty }^{2} }{2\alpha }\sum _{i=1}^{d} \frac { \sqrt {\hat {v}_{1,i}}}{1-\beta _{1}} \\+&\frac {D_{\infty }^{2} }{2\alpha }\sum _{i=1}^{d} \sum _{t=2}^{T} \left ({\frac {\sqrt {t \hat {v}_{t,i}}}{1-\beta _{1,t}} - \frac {\sqrt {(t-1)\hat {v}_{t-1,i}}}{1-\beta _{1,t-1}} }\right)\\=&\frac {D_{\infty }^{2}}{2\alpha }\sum _{i=1}^{d} \frac {\sqrt {T\hat {v}_{T,i}} }{1-\beta _{1,T}}\\\le&\frac {dD_{\infty }^{2}G_{\infty }}{2\alpha (1-\beta _{1})^{2}}\sqrt {T}~,\end{align*}
Corollary 5.5:
With the same assumption as in Theorem 5.1, and for all \begin{equation*}\lim _{T\to \infty }\frac {\sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}}{T} = 0,\end{equation*}
\begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T}=0.\end{equation*}
Proof:
By Theorem 5.1, it is sufficient to consider the term \begin{equation*}\frac {dD_{\infty }^{2}~G_{\infty }}{2\alpha (1-\beta _{1})^{2}} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}\end{equation*}
When either
Corollary 5.6:
With the same assumption as in Theorem 5.1, and either \begin{equation*}\lim _{T\to \infty } \frac {R(T)}{T} = 0.\end{equation*}
Proof:
By Corollary 5.5, it is sufficient to consider the term \begin{equation*}\sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}.\end{equation*}
\begin{align*} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}=&\sum _{t=2}^{T}\beta _{1}\lambda ^{t-1}\sqrt {t-1} \\\le&\sum _{t=2}^{T}\sqrt {(t-1)} \lambda ^{t-1} \\\le&\sum _{t=2}^{T}t \lambda ^{t-1} \\\le&\frac {1}{(1-\lambda)^{2}}\tag{21}\end{align*}
\begin{align*} \sum _{t=2}^{T}\beta _{1,t}\sqrt {t-1}=&\sum _{t=2}^{T}\frac {\sqrt {t-1}}{t} \\\le&\sum _{t=2}^{T}\frac {1}{\sqrt {t}} \\\le&2\sqrt {T},\tag{22}\end{align*}
Experiments
While we consider our main contributions as the theoretical analyses on AMSGrad and AdamX in the previous sections, we provide experimental results in this section for AMSGrad and AdamX. Concretely, we use the PyTorch code for AMSGrad2 via setting the Boolean flag
The learning rate is scheduled for both optimizers AMSGrad and AdamX as follows: 10−3, 10−4, 10−5, 10−6,
Testing accuracies over CIFAR-10 using AMSGrad and AdamX, with different neural network models.
Conclusion
We have shown that the convergence proof of AMSGrad [3] is problematic, and presented various fixes for it, which include a new and slightly modified version called AdamX. Our work helps ensure the theoretical foundation of those optimizers.