Accelerated and Deep Expectation Maximization for One-Bit MIMO-OFDM Detection

In this paper we study the expectation maximization (EM) technique for one-bit MIMO-OFDM detection (OMOD). Arising from the recent interest in massive MIMO with one-bit analog-to-digital converters, OMOD is a massive-scale problem. EM is an iterative method that can exploit the OFDM structure to process the problem in a per-iteration efficient fashion. In this study we analyze the convergence rate of EM for a class of approximate maximum-likelihood OMOD formulations, or, in a broader sense, a class of problems involving regression from quantized data. We show how the SNR and channel conditions can have an impact on the convergence rate. We do so by making a connection between the EM and the proximal gradient methods in the context of OMOD. This connection also gives us insight to build new accelerated and/or inexact EM schemes. The accelerated scheme has faster convergence in theory, and the inexact scheme provides us with the flexibility to implement EM more efficiently, with convergence guarantee. Furthermore we develop a deep EM algorithm, wherein we take the structure of our inexact EM algorithm and apply deep unfolding to train an efficient structured deep net. Simulation results show that our accelerated exact/inexact EM algorithms run much faster than their standard EM counterparts, and that the deep EM algorithm gives promising detection and runtime performances.


Introduction
In massive multiple-input multiple-output (MIMO), we have recently seen enormous interest in techniques that are related to low-resolution or coarsely quantized (CQ) signals from a fundamental signal processing viewpoint.This trend is driven by our interest in replacing high-resolution analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) with lower resolution ones, so that we can reduce the hardware cost and energy consumption of the radio-frequency (RF)

The Problem of Interest: CQ MIMO-OFDM Detection
In MIMO detection it is common to assume frequency-flat channels.Yet, in the current mobile systems, we consider frequency-selective channels.This is not a problem in the high-resolution scenario, because the current systems usually use the orthogonal frequency division multiplexing (OFDM) scheme, and OFDM breaks the frequency-selective channels into many parallel frequencyflat channels.This means that the MIMO detection methods for frequency-flat channels can be directly applied.Unfortunately, the same does not hold for the CQ case.The presence of quantization destroys the structure required to turn the problem into parallel frequency-flat channels.Subsequently, when we consider MIMO-OFDM detection with low-resolution ADCs, we have to engage with the full form of the problem.The arising difficulty is that the problem is a massive-scale problem; the problem size scales with the OFDM size, which is of the order of hundreds.While in principle we can directly apply the CQ MIMO detection methods in the frequency-flat channel case to the MIMO-OFDM case, it is computationally infeasible to do so due to the massive problem size.It is necessary to exploit the OFDM structure to efficiently realize MIMO detection methods.
There are far fewer studies for CQ MIMO-OFDM detection [10,[17][18][19].In [18], the PG method for the box ML relaxation formulation was developed.The PG method is able to exploit the OFDM structure to have per-iteration efficient implementation.In [10,19], the EM method for the Gaussian maximum a posteriori formulation was built.The EM method is iterative, and at each iteration we solve a problem-the so-called M-step problem-that takes the same form as MIMO-OFDM without quantization.Hence we can exploit the OFDM structure to have per-iteration efficient implementation, and this makes the EM method appealing in OMOD.The same benefit was also seen in the application of the variational Bayes method (which is related to EM) to CQ OMOD problem; e.g., the previously mentioned result of high-SNR slow convergence is unique to our problem.In terms of the analysis techniques used, our analysis is considered closest to Mairal's study [34], which considers a general majorization minimization (MM) framework and can cover EM.Mairal's study takes insight from the PG analysis techniques through some assumptions.We also draw insight from the PG analysis techniques, but we do so by best using our problem's specific structure.As we will discuss in details, our result arguably provides better characterization of the convergence behaviors.
One key result in this study is accelerated EM for OMOD.Indeed, Mairal's study [34] has described the same kind of acceleration for MM.To the best of our knowledge, no application was provided in Mairal's study and in subsequent studies.Our application is probably the first, showing the efficiency of such accelerated EM or MM.We should also discuss the application of deep unfolding.While deep unfolding has been widely considered in MIMO detection, it has not been used in CQ MIMO-OFDM detection.
The preliminary version of this study was presented in a conference [36].It studied inexact EM as a working idea.It does not consider EM convergence analysis, accelerated EM and deep EM, which are the main contributions in this paper.

Organization and Notations
This paper is organized as follows.Section 2, 3 and 4 review the EM basics, the MIMO-OFDM detection problem and the OMOD problem, respectively.In Section 5, we perform EM convergence analysis and establish the accelerated and inexact EM methods.Section 6 develops an accelerated inexact EM algorithms for OMOD.Section 7 builds a deep inexact EM algorithm via deep unfolding.Section 8 provides the numerical results, and we conclude in Section 9. Some basic notations are described as follows.The sets of real, non-negative, non-positive and complex numbers are denoted by R, R + , R − and C, respectively; we denote j = √ −1; boldface lowercase letters, e.g., x, represent column vectors; the ith element of a vector x is denoted by x i ; boldface capital letters, e.g., X, represent matrices; the superscripts ⊤ , H , −1 , and † denote the transpose, Hermitian transpose, inverse, and pseudoinverse, respectively; ℜ(x) and ℑ(x) denote the real and imaginary parts of x, respectively; x = (x 1 , . . ., x m ) means that x = [ x ⊤ 1 , . . ., x ⊤ m ] ⊤ ; • denotes the Euclidean norm; 0 and I denote an all-zero vector and an identity matrix, respectively; Diag(x) denotes a diagonal matrix whose (i, i)th element is x i ; x ∼ D means that x is a random vector following a probability distribution D; N (µ, Σ) denotes a multivariate Gaussian distribution with mean µ and covariance Σ; N (x; µ, Σ) denotes the distribution function corresponding to N (µ, Σ).

A Review of EM for Quantized Regression
Before we delve into the details of OMOD, we would like to first review the classic concepts of expectation maximization (EM) [37] in the context of quantized linear regression (QLR).In particular, the arising notion of iterative linear regression is worth appreciating, and we will need some of the details later.
We consider a basic QLR problem for which we have an observation y ∈ {±1} m following a model y = sgn(x), where sgn : R m → {±1} m is the sign function; A ∈ R m×n is known; θ ∈ R n is a model parameter; v ∼ N (0, σ 2 I) is noise; σ 2 is known.Note that x is unobserved.The problem is to estimate θ from y.As we will see later, OMOD can be viewed as an instance of QLR, with A and θ having certain structures.We consider the maximum-likelihood (ML) estimation θ ∈ arg max θ∈R n L(θ) := log p(y|θ), where p(y|θ) is the distribution of y given θ; the same notation will be used to denote the distributions of other random variables.It can be shown that p(y|θ) = m i=1 p(y i |θ), where a i denotes the ith row of A; where [11,38].
The ML estimation problem (2) has no closed-form solution in general, and EM is a popular way to solve the ML problem.Intuitively, the idea is to use the fact that estimating θ is easy if x is observable.To describe the EM method, let Here, E x [f (x)|y, θ ′ ] = X f (x)p(x|y, θ ′ )dx denotes a conditional expectation of x; X denotes the support of p(x|y, θ ′ ).It is known that G(θ|θ ′ ) ≤ L(θ) for all θ, θ ′ , and G(θ ′ |θ ′ ) = L(θ ′ ) for all θ ′ ; i.e., G(θ|θ ′ ) is a lower bound of L(θ).EM considers a successive lower-bound approximation An important question is whether where "const."does not depend on θ.It can be shown that where φ(t) = e −t 2 /2 / √ 2π; see, e.g., [29], for details.To prove (6), one needs to show that p(x i |y i , θ) is a truncated Gaussian distribution, namely, p(x i |y i , θ) = p(x i |θ)/p(y i |θ), with X (y i ) being the support; that E x i [x i |y, θ ′ ] is the mean of p(x i |y i , θ ′ ); that a truncated Gaussian mean has an explicit expression [29].The EM method can hence be written as where (6).The result is an iterative linear regression: we estimate x by the conditional mean x k , inferred from y and the previous estimate θ k ; and then we estimate θ by the linear regression in (7), which is easy to solve.This iterative linear regression process looks appealing and makes sense.Some history should be noted.The EM method for QLR dates back to as early as 1980 in statistics, with application to blood lead data [39].It also arose in signal processing, with applications to MIMO channel estimation and detection from coarsely quantized signals [10,19,29,30].

A Review of the MIMO-OFDM Background
In this section, we review some background of MIMO-OFDM detection.Section 3.1 describes the MIMO-OFDM signal model.Section 3.2 shows the subcarrier decoupling concept, which, as will be seen later, plays a pivotal role in the application of EM to OMOD.

MIMO-OFDM Model
Consider a wireless uplink scenario where a number of users simultaneously send signals to a multiantenna base station (BS).The transmitted signals undergo frequency selective channel effects, upon reception at the BS; and the user side employs the OFDM transmission scheme.The received signals at the BS can be modeled as where r m ∈ C W is a block of discrete-time signals received by the BS's mth antenna; W is the block length, or the OFDM size; M is the number of antennas at the BS; F ∈ C W ×W is the unitary discrete Fourier transform (DFT) matrix; s n ∈ C W is a block of symbols sent by user n; N is the number of users; H m,n ∈ C W ×W is a circulant matrix, constructed from a discrete-time impulse response h m,n = (h 0,m,n , . . ., h W ′ −1,m,n , 0, . . ., 0) ∈ C W of the channel from user n to the BS's mth antenna, with W ′ ≪ W ; ν m is circular complex Gaussian noise with mean 0 and covariance σ 2 C I, and with ν 1 , . . ., ν M being independent.The symbols s w,n 's are drawn from a QAM constellation set where D is a positive integer.The above model is standard, and we refer the reader to the literature (e.g., [40]) for details.For conciseness, we let r = (r 1 , . . ., r M ) and express (8) as where s = (s 1 , . . ., s N ), ν = (ν 1 , . . ., ν M ),

Subcarrier Decoupling in Unquantized MIMO-OFDM
Consider the problem of ML detection of the symbol vector s from the received signal r, given information of the channel H.It is well known that the ML detection problem can be formulated as ŝ ∈ arg min At first look, this is a massive-scale problem; the OFDM size W is often of the order of hundreds.But it is widely known that ( 11) can be decoupled into many smaller problems.Since H m,n is circulant, we have the eigendecomposition where hm,n = F h m,n .Let rm = F r m .One can show that where řw = (r 1,w , . . ., rM,w ) ∈ C M , Ȟw = [ hw,m,n ] m,n ∈ C M ×N and šw = (s w,1 , . . ., s w,N ) ∈ C N are the received signal, MIMO channel response and symbol vector at subcarrier w, respectively.We can hence rewrite (11) as We see that each problem in (13) has a much smaller size, and can be dealt with independently.The decomposition of (11) into (13) will be called subcarrier decoupling, in what follows.In MIMO-OFDM detection, subcarrier decoupling is commonly used to break the problem into many parallel smaller-size MIMO detection problems.

OMOD Problem Statement and Prior Study
In this section, we describe the background of OMOD.Sections 4.1 and 4.2 provide the problem formulation and statement.Sections 4.3 and 4.4 review the existing methods by proximal gradient and EM, respectively.

One-Bit MIMO-OFDM Detection (OMOD)
OMOD considers the same MIMO-OFDM scenario in Section 3.1, but with one-bit ADCs.In the MIMO-OFDM detection in Section 3, we assume that signal acquisitions at the BS have negligible quantization errors; to put it another way, the BS is equipped with high-resolution ADCs.For the one-bit ADC case, the acquired signals are where the r m 's are the unquantized MIMO-OFDM signals in (8).As mentioned in the Introduction, the one-bit ADC case is motivated by the massive MIMO scenario where we want to use cheap ADCs to reduce the hardware cost and energy consumption of the RF front ends.
The OMOD problem is formulated as follows.Applying the signal model ( 9) to (14) gives where q = (q 1 , . . ., q M ), v ∼ N (0, σ 2 I), σ 2 = σ 2 C /2.The OMOD signal model in ( 15)-( 16) is seen to take the form of the basic QLR model in (1) in Section 2, and hence we can apply the results in Section 2 to OMOD.Specifically, the ML detection of the symbol vector θ from the one-bit MIMO-OFDM signal y, given information of H and σ 2 , can be formulated as where b i = y i a i /σ; a i is the ith row of A; or, There is an important aspect to note.The subcarrier decoupling in Section 3.2, which allows us to decouple the massive-scale unquantized MIMO-OFDM problem into many smaller problems, does not work for OMOD.This is because f in (17) does not possess the structure required to apply subcarrier decoupling.This means that we will have to confront the massive-scale nature of MIMO-OFDM in the one-bit case.

Problem Statement
Our study will revolve around the following formulation where h : R n → R ∪ {+∞} is a penalty function that takes a component-wise separable form The above formulation is an approximation of the ML OMOD in (17).Consider the following two examples.

2) Box [18]
Consider where Á X (x) = 0 if x ∈ X , and Á X (x) = ∞ if x / ∈ X .This corresponds to a relaxation of the ML OMOD detector (17) by replacing the constellation set D with an interval [−U, U ].
We are interested in studying how problem (19), which has a massive scale, can be handled in an efficient fashion.The next two subsections describe two existing methods for (19).

Existing Method: Proximal Gradient
One way to handle problem (19) is to apply the proximal gradient (PG) method [23].The PG method for (19) is given by where η k > 0 is a step size; ∇f (θ) is the gradient of f at θ, and it can be shown that for which ψ(z) = (ψ(z 1 ), . . ., ψ(z m )), ψ(z) = φ(z)/Φ(z); is the proximal operator associated with h.We typically consider cases for which prox h (x) is easy to compute.For example, for the box case h = Á [−U,U ] n , we have We want to examine the complexity of the PG method.The most expensive part lies in computing the gradients ∇f (θ k ).At first sight, computing ∇f (θ) in (24) seems expensive since B is very large.But one can use the OFDM structure to reduce the complexity.It was shown in [18] that ∇f (θ) can be computed with a complexity of where C denotes the complexity to numerically compute Φ, given a precision1 .For the reader's convenience, in the supplemental material we describe the details with the computation of ∇f (θ).Simply speaking, to compute ∇f (θ) efficiently, we need to run fast Fourier transforms (FFTs) or inverse FFTs (IFFTs) for a total of 2M times.To sum up, the per-iteration complexity of the PG method is (26).
The PG method for OMOD was studied in [18], wherein the efficient FFT/IFFT-based method for computing the gradients was proposed, and the box relaxation was considered.

Existing Method: Expectation Maximization
Another way to handle problem (19) is to apply the EM method.By following the EM method reviewed in Section 2, we can show the following.In accordance with the E-step (5) of the EM method in Section 2, the EM surrogate function of the objective function f of the OMOD problem (19) is given by where is a conditional mean of the unquantized signal x in ( 15)-( 16), and, similarly, r ′ = E r [r|q, s ′ ] is a conditional mean of the unquantized MIMO-OFDM signal r in (9).The EM method is given by s k+1 ∈ arg min where, for convenience, we denote h(s) = h(ℜ(s)) + h(ℑ(s)).The merit of the EM method is that it allows us to apply the subcarrier decoupling in Section 3.2.To describe this, let By the subcarrier decoupling concept in Section 3.2, we can decompose (28) into for w = 1, . . ., W .The basic description of the EM method is complete.Intuitively, the EM method seems more promising than the PG method in the previous subsection.The EM method has no parameter, such as step size, to select; it inherits the merits of subcarrier decoupling, which seems to suggest better efficiency; it performs more optimization in one iteration, specifically, (30), which seems to suggest faster convergence.
The story is actually more complex, as one tries to unravel.We should not forget the fact that the conditional means r k m 's need to be computed.Examining the equations closely (see (6), ( 8), (9), and (15)), one will find that r k m is given by for all m, n, where ⊙ denotes the element-wise product.It can be verified that the complexity of ( 29) and ( 31) is particularly, we need FFTs and IFFTs for a total of 2M times.The complexity in (32) is the per-iteration complexity of EM, without counting the complexity required to solve (30).We see that (32) is not much better than the per-iteration complexity (26) for the PG method.When the complexity of solving (30) factors in, the per-iteration complexity of the EM method is likely to be higher than that of PG.The next question then goes to whether EM leads to faster convergence.We will try to answer this question in the subsequent sections.
The EM method for OMOD was studied in [10], wherein the efficient subcarrier-decoupling implementation was proposed, and GMAP was considered.For GMAP, the problems in (30) have closed-form solutions, and this advantage was exploited.

Convergence Analysis and New Insights
In the last section, we asked the question of whether the EM method would converge faster than the PG method in the application of OMOD.In this section we perform convergence analysis with EM.We will show that, in theory, the EM method should converge faster than the PG method.However, this is not the most striking result.The insight gained from our analysis will lead us to develop an accelerated variant of the EM method, which converges even faster.It will also enable us to consider inexact variants of the EM methods, which provides flexibility with implementations.We will also expand our analysis to include the non-convex case; i.e., instances for which the penalty function h is non-convex.
This section is organized as follows.Section 5.1 gives the problem setup.Sections 5.2 and 5.3 describe the convergence results for the PG and EM methods, respectively.Section 5.4 reveals the insight behind our EM convergence analysis.Section 5.5 describes a result that indicates that the PG and EM methods can converge slowly for high SNRs.Taking insight from the EM convergence analysis, Section 5.6 and 5.7 develop the accelerated and inexact EM methods, respectively.Section 5.8 considers the convergence analysis for the non-convex case.It is worth noting that, while we develop these results for OMOD, they can be applied to other QLR problems.

Problem Setup and Preliminary Concepts
Let us write down our main problem, problem (19), again: where is proper, lower semi-continuous and convex.We assume that problem (33) has an optimal solution, and we will use the notation θ ⋆ to denote an optimal solution to problem (33).In our study, we will see problem (33) as a problem with a general B and h.Hence, we basically consider a class of QLR problems, not just OMOD.
We also recall some basic formulas.The gradient ∇f (θ) can be written as where The EM surrogate function of f can be written as see ( 5)-( 6).We have g(θ|θ ′ ) ≥ f (θ) for all θ, θ ′ , and g(θ ′ |θ ′ ) = f (θ ′ ) for all θ ′ .Some concepts and notations should be introduced.Let ϕ : R n → R ∪ {+∞}, and let X ⊆ R n .The domain of ϕ is defined as dom Consider a minimization problem min θ∈R n ϕ(θ).A point θ ∈ R n is said to be a critical point of the problem if 0 ∈ ∂ϕ( θ), where ∂ϕ(θ) is the limiting subdifferential of ϕ at θ; see [41] and the references therein.A point θ is said to be an ε-critical point of the problem if dist(0, ∂ϕ( θ)) ≤ ε, where dist(y, X ) = min x∈X y − x denotes the distance between y and X .If ϕ is convex, then any critical point of the problem is an optimal solution to the problem.The notations σ max (X) and σ min (X) denote the largest and smallest singular values of X, respectively; •, • denotes the inner product.

Proximal Gradient
Consider the PG method for problem (33): where we assume a constant step size η > 0. The following result is well known.
Fact 1 (see, e.g., [23]) Consider problem (33) and the accompanied problem setup, and consider the PG method (38).Suppose that f has L f -Lipschitz continuous gradient on dom F .If the step size is chosen as η = 1/L f , then we have Our question is whether f possesses the desired Lipschitz continuous gradient property.This appears to have not been answered before, and we show that the answer is yes.
Consequently, the PG method (38) with step size η = 1/σ max (B) 2 has the following convergence result for problem (33): The proof of Proposition 1 is given in Appendix A. Proposition 1 gives several implications.First, it confirms that the PG method guarantees convergence to the optimal solution with a convergence rate of at least 1/k.Second, it indicates how the convergence scales with the problem instance B; specifically, it scales with σ max (B) 2 .This will be useful in our comparison with the EM method later.Third, it suggests how we can select the step size, specifically, by η = 1/σ max (B) 2 .In the OMOD application, we can show from (18) that which gives us a simple way to select the step size.Note that the prior OMOD work [18] selects the step size manually.

Expectation Maximization
Next, consider the EM method for problem (33): We have the following convergence result.
Proposition 2 Consider problem (33) and the accompanied problem setup, and consider the EM method (41).We have The proof of Proposition 2 is given in Appendix B.1, and we will describe the insight behind Proposition 2 in the next subsection.Eq. (42) shows that the EM method has a convergence rate of at least 1/k-same as PG.It is interesting to compare the convergence results of PG and EM.Since the EM convergence result ( 42) is seen to be at least no worse than the PG result in (39).In fact, (43) suggests that the EM method should converge faster than the PG method, particularly when the gap in ( 43) is large.

Insight Behind the Convergence Results, and Prior Work
It is interesting to see what is the insight behind the EM convergence result in Proposition 2. To see it, we need to understand how PG convergence results were established in the literature.The idea is to see PG as a majorization-minimization (MM) method.Specifically, consider the MM method θ k+1 ∈ arg min where u(•|θ ′ ) is a majorant of f at θ ′ ; i.e., u(θ|θ The PG method (38) with step size η = 1/L f is seen as an MM, with Many PG convergence proofs make strong use of the identity (45).As an important discovery, we found that the EM surrogate function g in (36) possesses the following structure We see that (46) resembles (45).This has tremendous insights.It suggests that we may use the PG convergence proof concepts in the literature to show the convergence for EM.In fact, the proof of Proposition 2 is conceptually the same as that of Fact 1.
The prior works on EM convergence analyses should be mentioned and compared.The question is whether the available convergence analysis results [32][33][34][35] can lead to the same or better claim than our result in (42).This aspect is complex, and we provide a detailed account of it in the supplemental material.In summary, the best result we can find is for some constant R > 0; it comes from Mairal's analysis [34].Our result in (42) appears to be better than (47); e.g., it is unconvincing to use (47) to argue that EM should converge faster than PG.

A Slow Convergence Phenomenon
The PG and EM theoretical bounds in ( 39) and ( 42) are seen to be loose if B has a large magnitude.From ( 18), the magnitude of B is seen to increase with the SNR; see also (40).This gives an impression that, under high SNRs, the PG and EM methods could converge slowly.Further analysis shows that this is true for our interested choices of h.
Proposition 3 Consider problem (33) and the accompanied problem setup.Consider either the PG method (38) with η = 1/σ max (B) 2 , or the EM method (41).Suppose that θ k is ε-optimal in the sense that for some optimal solution θ ⋆ to problem (33) and for ε > 0.
(a) Suppose that h(θ) = Á X (θ) for some non-empty convex closed set X .Then k must satisfy for the PG method; and for the EM method, where σ + min (B) denotes the smallest positive singular value of B, and for the PG method; and for the EM method.
The proof of Proposition 3 is given in Appendix C. Proposition 3 shows that the number of iterations required by the PG or EM method must increase with the magnitude of B, or the SNR, assuming that the other variables are held fixed.Let us show the insight behind Proposition 3. As an example, consider h(θ) = 0 and the PG method (38) with step size η = 1/σ max (B) 2 .From (34) one can show that where the second inequality is due to 0 < ψ(z) < 1 (see (35)).Applying (50) to the PG method (38) gives We see that, if the magnitude of B is large, then every θ k changes little, and we have slow convergence.This is the idea that underlies the results in Proposition 3.

Accelerated EM
The PG-EM relationship revealed in Section 5.4 opens new possibilities.In first-order convex optimization, it is well known that the convergence of the PG method can be accelerated by applying extrapolation [20,21].We can accelerate the EM method in the same manner.Consider the following modified EM method for k ≥ 0, where ϑ 0 = θ 0 ; {α k } k≥1 is an extrapolation coefficient sequence.We employ the FISTA sequence where t 0 = 1, The modified EM method in (52)-(55) will be called the accelerated EM method, in what follows.
The accelerated EM method is the direct application of the extrapolation in accelerated PG to the standard EM method (41).Note that, if we replace g(θ|ϑ k ) in (52) with u(θ|ϑ k ) in (45), we return to the accelerated PG method.We have the following result.
Proposition 4 Consider problem (33) and the accompanied problem setup, and consider the accelerated EM method (52)-(55).We have The proof of Proposition 4 is given in Appendix B.3.Proposition 4 indicates that accelerated EM has an accelerated convergence rate of at least 1/k 2 , which is the same as that of the accelerated PG method.

Inexact EM
The EM method (41) requires us to solve the problems in (41) exactly.When the exact solutions are computationally non-negligible to compute, we may want to accept economical, but inexact, solutions.This motivates the study of inexact EM for k ≥ 0, where θ k+1 is an approximate solution, produced by a solver that may not give high solution precision due to its limited complexity.The question is how robust the EM method is under solution errors.Intriguingly, the same kind of questions has been addressed in the context of inexact PG [22].We take insight from the aforementioned context to answer the question at hand.We characterize the inexactness of θ k+1 by dist(0, ∂G(θ for some ε k+1 ≥ 0; ε k+1 = 0 means that θ k+1 is an optimal solution to the problem in (57).Also, we assume the following descent condition Eq. ( 59) makes sense in the following way: If we do a warm start with the inexact solver for problem (57) by using θ k as the initialization, the resulting approximate solution θ k+1 should yield an objective value G(θ k+1 |θ k ) better than the initial objective value G(θ k |θ k ).We have the result below.
Proposition 5 Consider problem (33) and the accompanied problem setup.Consider the inexact EM method (57), with (58)-(59) being satisfied.Suppose that B has full column rank.For k ≥ 1, we have The proof of Proposition 5 is given in Appendix B.2. Proposition 5 reveals two aspects.First, the inexact EM method can lead to convergence, with a rate of at least 1/k, if the solution accuracy ε k improves with k in such a way that ∞ i=1 ε i < ∞; i.e., ε k decreases at a rate of k −p for p > 1, or faster.Second, the inexact EM method may be more sensitive to solution errors if the smallest singular value σ min (B) of B is smaller, i.e., B is more ill-conditioned.In massive MIMO, we should expect that B is reasonably well-conditioned.
We can also consider the accelerated version of inexact EM.
Proposition 6 Consider problem (33) and the accompanied problem setup.Consider the accelerated version of the inexact EM method (57), wherein θ k is replaced by ϑ k in (53)-(55).Suppose that (58) holds, and that B has full column rank.For k ≥ 1, we have The proof of Proposition 6 is given in Appendix B.4.Note that the above result does not require the descent condition (59).Proposition 6 looks similar to its non-accelerated counterpart in Proposition 5.The notable difference is that, to achieve a convergence rate of at least 1/k 2 in Proposition 6, we need the solution accuracies ε k 's to satisfy ∞ i=1 i ε i < ∞; i.e., ε k decreases at a rate of k −p , p > 2, or faster.This means that the accelerated inexact EM method may require higher solution accuracies than the non-accelerated counterpart.

The Case of Non-Convex h
Our preceding analyses assume convex h.We also want to handle non-convex h.Consider the same inexact EM scenario in (57).The new difficulty is that the problems in (57) are non-convex.We assume that the inexact solver for the problems in (57) can only guarantee finding an ε-critical point, rather than an optimal or ε-optimal solution.The ε-criticality of θ k+1 is characterized by (58).Also, we assume that the descent condition (59) holds; our justification is the same as before.Furthermore, we modify G(θ|θ k ) in (57) as for some τ > 0. We have the following result.
Suppose that B has full column rank.We have , and .
The proof of Proposition 7 is given in Appendix B.5. Eq. ( 61) employs an ellipsoidal extension of the ε-critical point definition to pin down the convergence.As an easier way to understand, consider the following expression which can be shown to be an implication of (61): Eq. ( 62) suggests that the EM method leads to a δ-critical 6 Accelerated EM Algorithms for OMOD In this section we use the new theoretical insights gained in the previous section to build efficient algorithms for OMOD.Our attention is paid to the implementation of the accelerated inexact EM (AIEM) method.

An AIEM Algorithm for a General Convex h
Consider the OMOD problem (19) with a general convex h.By modifying EM OMOD method in Section 4.4 as the AIEM method in Section 5.7, we have the following AIEM OMOD method: Let s 0 be the initialization, and let s 0 ex = s 0 .We perform, for k = 0, 1, 2, . .., 1. E-step: For all m, compute the conditional mean estimate r k m = E rm [r m |q, s k ex ] in the same way as (31).
2. M-step preparation: For all m, compute rk m = F r k m via FFT.Let řk w = (r k 1,w , . . ., rk M,w ) for all w.
Note that if we solve the problems in (63) exactly, the above method becomes the accelerated (exact) EM; that if we set α k = 0 for all k, the above method becomes the regular non-accelerated EM; and that (64) is identical to the solution quality requirement in (58).Proposition 6 suggests that we can choose ε k = C k −p for some C > 0 and p > 2.
We have not specified the inexact solver for the M-step problems in (63).The AIEM method provides us with the freedom to choose the solver, and in this study we employ the accelerated PG (APG) method.Fixing k and w, the APG method for the problem in (63) is given by for j ≥ 0, where η w = 1/σ max ( Ȟw ) 2 is the step size, and ∇ϕ k w (s) = ȞH w Ȟw s − ȞH w řk w .We stop the APG loop when {s j+1 w } W w=1 meets the solution quality requirement (64).Assembling the above components together, we obtain the AIEM algorithm in Algorithm 1.
Algorithm 2 DIEM, inspired by AIEM in Algorithm 1 set s k+1 such that šk+1 w = s k+1 w for all w 12: end for

A Deep Inexact EM Algorithm for OMOD
In this section we apply deep unfolding [24,28] to the inexact EM algorithm to build a data-driven detector for OMOD.The idea of deep unfolding is to alter an existing iterative algorithm by datadriven learning, thereby hoping to learn a better algorithm.Specifically, we see each iteration of the predecessor algorithm as a network layer, untie some of the parameters or alter some of the structures, and learn the untied parameters or new structures from data.We tried various ways to deep-unfold Algorithm 1.The successful one, by our empirical tests, is shown in Algorithm 2; we will call the algorithm DIEM.The untied parameters are marked red, and some notable changes marked blue.
The most notable alternation of DIEM is Line 9 of Algorithm 2. We replace the proximal operator prox ηwσ 2 h (Line 12 of AIEM) with a nonlinear activation function Ω γ .We use a multilevel sigmoid function; its real-valued scalar version is where γ > 0, ̺ : R → [−1, 1] is a 0-centered sigmoid function: we have Ω γ (s) = [ℜ(Ω γ (s i )+j ℑ(Ω γ (s i ))] i for the complex-valued vector case.The sigmoid functions were used in deep MIMO detection [25] as the activation function.The intuitive idea is that Ω γ approaches the decision function of the constellation set as γ → ∞, and we want to use it to encourage the symbol estimates to lie closer to the constellation points.
The other alterations of DIEM are as follows: we abandon acceleration in the EM loop; we consider one-step APG with the inexact solver; we untie the step size η k and extrapolation coefficient α k of the APG in a layer-varying fashion; we add a parameter β k to partially alter the noise standard deviation.
The training of DIEM is standard.Let p = {s, H, σ, q} be a problem instance, let π = {α, β, γ, η} be the set of network parameters, and let o π (p) be the network output.We generate a large number of problem instances p 1 , . . ., p T according to the signal model.Then we learn π from p 1 , . . ., p T using a deep learning software; the objective is where s t is the symbol vector of the problem instance p t .
We want to provide an interpretation for DIEM.Suppose we choose a non-convex penalty h with the OMOD (19) to better approximate the difficult constellation constraints.As a concurrent study, we recently showed that there exists a constellation-promoting penalty function h such that its proximal operator may be approximated by the sigmoid activation function (69); see [42] for details.As studied in Section 5.8, the corresponding non-convex OMOD problem ( 19) can be handled by inexact EM.Following Section 6, we can show that the non-convex inexact EM can be implemented by nearly the same procedure as the non-accelerated version of Algorithm 1. From this view, DIEM can be regarded as the deep unfolding of the non-convex inexact EM.

BOX-PG
: the PG method for box OMOD, proposed previously in [18] and reviewed in Section 4.3; 4. BOX-AIEM: the accelerated inexact EM method for box OMOD, built in Section 6; 5. BOX-EM: the standard EM method for box OMOD, implemented by BOX-AIEM with a high solution accuracy; 6. DIEM: the deep algorithm in Section 7; 7. ZF: direct application of the zero-forcing detector in unquantized MIMO-OFDM to OMOD.
We tested all the algorithms by MATLAB 8.5 on the same desktop with Intel i7-7700 processor and 16GB RAM memory.
The parameter settings with the PG and EM algorithms (the non-deep ones) are as follows.The stopping condition is either s k+1 − s k / s k ≤ 5 × 10 −4 or k > 1000.For BOX-AIEM, we set the M-step solution accuracies as ε k = 2N W k −2.1 .For BOX-EM, we set ε k = 2N × 10 −4 .For BOX-PG, we choose the step size η as the reciprocal of (40); see Proposition 1.We set the noise power parameter σ 2 as where σ 2 actual is the actual noise power; σ 0 > 0 is some given constant.Doing so is to avoid slow convergence.As shown in Proposition 3, the PG and EM methods can converge slowly for small σ 2 .In our simulations, we set σ 0 = 3.
The training of DIEM is implemented on Pytorch 1.2 platform for Python 3.5.2(the test of DIEM is on MATLAB).The training optimizer is stochastic gradient descent, with step size 10 −3 .The SNR range in our training is the SNR range in our simulations, with an additional margin of 2 to 5dB.The number of layers is K = 20.The open source code is available at https://github.com/shaomingjie/DeepEM_MIMO_det.git.

Comparison of EM and Accelerated Exact/Inexact EM
In this subsection we focus on comparison of the standard EM method and our modified EM methods.Fig. 1 shows the bit error rate (BER) performance of the GMAP-and box-based algorithms.GMAP-AEM is seen to yield nearly the same performance as GMAP-EM.This is expected since both solve the same problem, and the problem is convex.Similarly, we see the same behaviors with the box-based algorithms.Fig. 2 shows the average numbers of EM or PG iterations used by the algorithms.As can be seen, the accelerated EM methods use much less iterations than the PG and standard EM methods.This demonstrates the benefit of accelerated EM.
From Fig. 2, we also observe a phenomenon, namely, that the numbers of iterations of the PG and EM algorithms rise with the SNR.This agrees with the theoretical results in Proposition 3. Fig. 3 shows the average runtimes.It is more interesting to examine the box case.While we see in Fig. 2(b) that BOX-EM uses less numbers of iterations than BOX-PG, we see in Fig. 3(b) that BOX-EM runs slower than BOX-PG.This is because the M-step in EM requires non-negligible computations.Moreover, BOX-AIEM is seen to be faster than BOX-EM (and also BOX-PG); the use of inexact M-step is a key factor for its runtime advantage.

Performance Comparison for DIEM
In this subsection we demonstrate the performance of DIEM.Figs.4-5 show the BER performance of DIEM under various system settings.We benchmark DIEM against ZF, GMAP-AEM and BOX-AIEM.DIEM is seen to exhibit remarkably better BER performance than the other algorithms.Tables 1-2 show the runtimes of the various algorithms.As can be seen, DIEM is much faster than the other algorithms.The reason for the efficiency of DIEM is that it uses a fixed number of layers, specifically, 20; and each layer has simple operations, specifically, one-step APG-like operations.

Conclusion
To conclude, the EM convergence rate was analyzed for a class of approximate ML formulations for OMOD.The analysis revealed the possibility of slow convergence under high SNRs.It also led to accelerated and inexact EM schemes.The implementation of these schemes for OMOD was studied, and the use of deep unfolding in EM for OMOD was also considered.The resulting EM algorithms were numerically shown to have competitive performance in the OMOD application.In deep learning-based methods, empirical study plays a key role.As a future direction, it would be useful to use numerical experiments to thoroughly examine the reliability and efficiency of the proposed deep unfolding EM method under different conditions.

Proof of Lemma 1:
Moreover, ψ(z) is closely related to Mill's ratio, and it was shown in the context therein that dψ(z)/dz ≤ 1 for all z [44].The desired result follows.
From the expression of ∇f (θ) in (34), we have where we apply Lemma 1 to obtain the second inequality.The proof is complete.

B Proof of Propositions 2 and 4-7
The proofs of Propositions 2, 4, 5, 6 and 7 follow the same principles as their PG counterparts [22,23], although a careful verification remains necessary.The following lemma, which considers the structure of our problem, is important.

B.1 Proof of Proposition 2
We start with the basic EM method.Applying Lemma 2 to the EM method (41), with θ = θ ⋆ and ε k = 0, we have Also, by the basic EM property F (θ k+1 ) ≤ F (θ k ), we have

B.2 Proof of Proposition 5
Next, we consider the inexact EM scenario in (57)-(59).For convenience, let . Following the same proof in the preceding subsection, but with ε k = 0 in general, one can show that, for k ≥ 1, Note that (80) is due to the descent condition F (θ k+1 ) ≤ F (θ k ), enabled by the condition 59).The new challenge lies in the last term of (79).To deal with it, assume that B has full column rank.We have where we have used the result Bx ≥ σ min (B) x .Since w k ≥ 0, Eqs. ( 79) and (81) imply that where we denote σ min = σ min (B) for convenience.Consider the following lemma.
Lemma 3 [22, Lemma 1] Let {u k } k≥0 be a non-negative sequence.Suppose that, for k ≥ 1, where λ i ≥ 0 for all i.Then, for k ≥ 1, we have Applying Lemma 3 to (82) gives for k ≥ 1, where Now we use (81) and (83) to bound the last term of (79).From (79), we have where the second inequality is obtained by applying (81) and (83); and the third inequality is due to A k ≥ A i for i ≤ k.Applying (85) and ( 84) to (80) leads to the final result in Proposition 5.

B.3 Proof of Proposition 4
To describe the proof for the accelerated EM method, we first write down some basic results [21].The sequence {t k } k≥0 in (55) satisfies (a) t 2 k − t k = t 2 k+1 ; (b) t k ≥ (k + 2)/2; and (c) t k ≤ k + 1.Also, the following result is important. where Note that Lemma 4 uses only the convexity of F and the property t 2 k − t k = t 2 k+1 .Applying Lemma 2 to the accelerated EM method (52)-( 55), with θ given by the one in Lemma 4, θ k = ϑ k and ε k = 0, we have for It can be verified that, for k ≥ 1, Combining ( 86)-(88) gives The above inequality further implies that Applying Lemma 2 to the accelerated EM method (52), with k = 1, θ = θ ⋆ and ε k = 0, and noting ϑ 0 = θ 0 , we have 2 .By noting that t 0 = 1 and u 1 = θ 1 − θ ⋆ , we further obtain Applying the above inequality to (89) gives Finally, by applying t k ≥ (k+2)/2 to the above inequality, we obtain the final result in Proposition 4.

B.4 Proof of Proposition 6
The proof for the accelerated inexact EM scenario is a combination of the proof methods in the last two subsections.Following the same proof in the previous subsection, with ε k = 0 in general, one can show that, for k ≥ 1, and that The above two inequalities imply that, for k ≥ 0, where Following the same proof in the last last subsection (specifically, by seeing (82) as (90)), one can show that where A k+1 = ( k+1 i=1 t i−1 ε i )/( √ 2σ min ).Finally, by applying t k ≥ (k + 2)/2 and t k ≤ k + 1 to the left-hand and right-hand sides of the above equation, respectively, we obtain the final result in Proposition 6.

B.5 Proof of Proposition 7
The proof for the inexact EM scenario (57)-(60) in the case of non-convex h is as follows.First, we show that min i=0,...,k−1 for k ≥ 1 and for any δ 1 , . . ., δ k ≥ 0, where From the expression of G(θ|θ ′ ) in (60), we have where the first inequality is due to f (θ) ≤ g(θ|θ ′ ), and the second inequality is due to the descent condition (59).Summing the above equation over i = 0, . . ., k − 1 yields and further, Applying k−1 i=0 u i ≥ k min i=0,...,k−1 u i to the left-hand side of the above equation leads to the result in (91).
Second, in a similar way as (93), we have where we have used the fact that x ≥ V ⊤ 1 x for any semi-orthogonal V 1 .For h(θ) = Á X (θ), X being non-empty closed convex, it can be verified that Prox h,B (z) = arg min θ∈X B(z − θ) 2 , and then, that θ i ∈ Prox h,B (θ i ).Consider an extension of Lemma 5.

Lemma 6 Let
Proof of Lemma 6: The proof is similar to that of Lemma 5; see, e.g., [23].By the optimality condition of the problem in (103), θ 1 and θ 2 must satisfy Since h is convex, we have the following property see, e.g., [38].Applying (105) to the above equation, we obtain where the Cauchy-Schwarz inequality is used to obtain the third equation.The proof is complete.
Applying the above results to the equivalent form of EM in (102), we obtain where we have also used σ max (BW † B ⊤ ) = 1, which can be verified from (101), and ψ(Bθ) ≤ √ m (see ( 95)).Applying (106) to (104b), and using Finally, it should be noted that σ r is the smallest positive singular value of B; and that V ⊤ 1 x = B ⊤ (BB ⊤ ) † Bx , which can be verified from (101).Applying the above results to (107) leads to the desired result.

A Discussion on the Applications of the Existing Convergence Analysis Results
In this section we examine and discuss the applications of the existing convergence analysis results to the EM method, with an emphasis on our problem.The EM convergence was first studied by Wu [32] in the statistics field.EM is a special case of the MM method, and there is a rich collection of literature on the convergence of the MM and related first-order methods in the optimization field; see, e.g., [33][34][35] for studies closer to the context of EM.In these MM studies we often consider a general framework that covers a variety of optimization schemes and a rich class of problems.In that regard, we should point out that we focus on a specific problem arising from MIMO detection or regression from quantized data, and exploit opportunities unique to our problem.
Let us be specific.The convergence analyses by Wu [32] and Razaviyayn et al. [33] show that the EM or MM method can converge to a stationary point.They consider the non-convex case with fairly general assumptions, but the convergence rate is not considered.The MM convergence rate in the convex case was studied by Mairal [34] and Hong et al. [35].If we apply the analysis framework by Hong et al. to the EM method (41) for our problem (33), we have the following result: for some constant C ≥ 0, where κ(B) = σ max (B)/σ min (B) is the condition number of B. Also, the application of Mairal's framework leads to for some constant R > 0. We will give the details of (114)-(115) shortly.The result in (114) is weak compared to our EM result in (42); in fact, it is even weaker than the PG result in (39).Also, as discussed in the main manuscript, the result in (115) is arguably not as good as our EM result in (42).

Proof of (114)
Consider problem (33) and the accompanied problem setup, and consider the EM method (41).Assume the following: The framework by Hong et al. [35] gives the following result where C 1 = max{4µ − 2, F (θ 0 ) − F (θ ⋆ ), 2}, and The question is what are the values of γ and β for our problem.It is seen from ( 46) that γ = σ min (B) 2 .Also, it can be shown from (76) and Proposition 1 that β = 2σ max (B) 2 .Applying the above results to (116) leads to (114).

Proof of (115)
Again, consider problem (33) and the accompanied problem setup, and consider the EM method (41).Let us assume: Mairal's framework [34] gives the following result To show the value of L r for our problem, we first note that f is convex and has σ max (B) 2 -Lipschitz continuous gradient (cf.Proposition 1 for the latter).From the expression of g in (46), one can easily verify that g(θ|θ ′ ) is convex w.r.t.θ and has σ max (B) 2 -Lipschitz continuous gradient w.r.t.θ, given any θ ′ .By Lemma B.9 in the supplementary material of [34], r(θ|θ ′ ) has σ max (B) 2 -Lipschitz continuous gradient w.r.t.θ.We therefore have (115).

Additional Numerical Results
In this section we provide additional numerical results to illustrate the behaviors of the PG and EM methods.

Convergence Behaviors
We want to see whether the PG, EM and AEM methods behave as predicted by the theoretical convergence results in Propositions 1, 2, 3 and 4. We first consider the theoretical upper bound results in Propositions 1, 2, and 4. To do this, we randomly generated an instance (y, A, θ) in accordance with the basic QLR model in (1), with (m, n) = (16,4) and with the elements of θ and A being i.i.d.N (0, 1).Then we ran the PG, EM and AEM methods for problem (33) with the penalty function being h(θ) = θ 2 /2 and with the initialization being randomly generated.Fig. 6 shows the objective value gaps F (θ k ) − F (θ ⋆ ) of the three methods and their respective theoretical upper bounds in Propositions 1, 2, and 4; note that this is a one-trial result.We see that the PG, EM and AEM methods appear to converge at rates that are at least no worse than those of the respective theoretical upper bounds, which gives support to our theory.We also notice that the objective value gaps of the three methods are far better than the respective theoretical upper bounds, suggesting that the three methods may converge faster than what theory predicts.
Next we consider the theoretical lower bound results in Proposition 3. We use the same method as above to perform numerical experiments.For each method, we evaluate the smallest number of iterations required to achieve θ k − θ ⋆ ≤ ǫ, with ǫ = 10 −2 .Fig. 7 shows the iteration numbers of the PG and EM methods and their respective theoretical lower bounds in Proposition 3; again it is a one-trial result.We see that the iteration numbers of the PG and EM methods increase with the SNR, and the rates of increase appear to be consistent with those of the theoretical lower bounds.

Impact of Noise Power Loading
We examine the impact of noise power loading in (70).The simulation settings are identical to those in Section 8.For brevity we consider GMAP-EM only; the other algorithms exhibit similar behaviors, as we observed by numerical study.Fig. 8 shows the BER performance for various values of the loading parameter σ 0 .We see performance loss when loading is applied; the loss increases with σ 0 .Table 3 shows the average number of iterations for various values of σ 0 and the SNR.We see that, without noise power loading, convergence is particularly slow for high SNRs; and that such undesirable effects are alleviated when loading is applied.

Figure 6 :
Figure 6: Objective value gaps of PG, EM and AEM.

Figure 7 :
Figure 7: Iteration numbers of PG and EM.
all m w Ȟw s j ex,w − ȞH w řk w for all w w = σ −2 ( ȞH w Ȟw s j w − ȞH w řk w ) for all w w = s j w for all w 19: all m