An Explanation of Deep MIMO Detection From a Perspective of Homotopy Optimization

Since the work of detection network (DetNet) by Samuel, Diskin and Wiesel in 2017, deep unfolding for MIMO detection has become a popular topic. We have witnessed significant growth of this topic, wherein various forms of deep unfolding were attempted in the empirical way. DetNet takes insight from the proximal gradient method in terms of the use of the network structure. In this paper, we endeavor to give an explanation of DetNet—in a fundamental way—by drawing connection to a homotopy optimization approach. The intuitive idea of homotopy optimization is to gradually change the optimization landscape, from an easy convex problem to the difficult MIMO detection problem, such that we may follow the solution path to find the optimal MIMO detection solution. We illustrate that DetNet can be interpreted as a homotopy method realized by the proximal gradient method. We also illustrate how this interpretation can be extended to the Frank-Wolfe and ADMM variants of realizing the homotopy optimization approach, which result in new DetNet structures. Numerical results are provided to give insights into how these homotopy-inspired DetNets and their respective non-deep homotopy methods perform.


I. INTRODUCTION
Lately there has been much enthusiasm for using deep learning to perform MIMO detection. In SPAWC 2017, Samuel, Diskin and Wiesel made the first attempt to design a structured detection network (DetNet) for MIMO detection by means of deep unfolding [1]. Unlike the standard black-box deep learning approach, deep unfolding is an approach that builds structured deep neural networks based on other iterative algorithms [2]. DetNet mimicked a proximal gradient algorithm for the maximum-likelihood (ML) MIMO detection problem, and the authors demonstrated competitive detection performance and reasonable computational time with DetNet. The success of DetNet triggered many efforts to try unfolding different MIMO detection algorithms, ranging from optimization methods [3], [4], [5], [6], [7] to statistical inference methods [8], [9].
As an appealing feature, deep unfolding-based MIMO detection provides good intuition on explaining the operational mechanisms of the deep MIMO detectors. In building such deep MIMO detectors, it is common to introduce some activation function that we typically see in deep learning. Clipping function and sigmoid functions are most widely used, as they can force the network output to be close to the desired constellation points [1], [4], [6], [7]. With the use of these activation functions, deep MIMO detectors become different from their predecessor algorithms. An intriguing question arises: is there any theoretical basis for the deep MIMO detectors to use these activation functions?
In this paper we attempt to provide an explanation on the use of the activation functions in DetNet. We do so by drawing a connection between DetNet and homotopy optimization. Homotopy optimization is a general principle for non-convex optimization that has a variety of applications [10], [11], [12], [13], [14]. The idea is to gradually change the optimization landscape, from an easy problem to the difficult target problem, such that we may track the solution path to reach the optimal solution of the target problem. In our previous study, we developed a homotopy formulation for ML MIMO detection with binary constellation points [7]. The present study is built upon our previous homotopy formulation. We show how a specific proximal gradient-based implementation of the homotopy method leads to an algorithm structure that is similar to DetNet. In particular, the choice of different penalty functions for the homotopy formulation gives rise to different activation functions, including the clipping and sigmoid functions. We also show a similar connection with the Frank-Wolfe and ADMM methods, suggesting that there are other structures for DetNets. Our homotopy interpretation of Det-Net can be extended to the higher-order constellation case, as we will show. While our interest lies in giving explanation and drawing connections, we will also provide numerical results to demonstrate the performance of the new homotopy-inspired DetNets arising from our study, as well as the performance of their original non-deep counterparts.
We should mention that, in the broad scope of signal processing and machine learning, researchers have recently taken interest in better connecting deep methods with model-based or theory-driven concepts [15], [16], [17], [18]. While the majority of these studies consider compressive sensing or inverse problems in imaging (see the aforementioned references), which are different from MIMO detection, it is worth saying that our study has a similar flavor in terms of working toward a better understanding of model-based and structure-exploiting deep methods for specific problems.
Some notations are as follows. The notations · and ·, · denote the Euclidean norm and inner product, respectively.
We denote I X as the indicator function of X , i.e., I X (x) = 0 if x ∈ X and I X (x) = ∞ otherwise.

A. MIMO DETECTION PROBLEM
Our main interest focuses on the following problem where f : R n → R is convex and has L f -Lipschitz continuous gradient. Our study is motivated by MIMO detection. In MIMO detection, we have a signal model y = Hx + v, where y ∈ R n is a received signal, x ∈ {±1} n is a symbol vector, H ∈ R m×n is the channel matrix, and v is noise. Our goal is to detect x from y, with H being known. A typical MIMO detection formulation is the maximum-likelihood formulation, which takes the form of (1) with The above f has σ max (H ) 2 -Lipschitz continuous gradient, where σ max (H ) denotes the largest singular value of H. The problem setup in (1) also applies to one-bit MIMO detection, in which the signal model is y = sgn(Hx + v) and the maximum-likelihood objective function is v is the noise power; see, e.g., [19]. The above f has σ max (H ) 2 /σ 2 v -Lipschitz continuous gradient [20].

B. DEEP MIMO DETECTION
Lately there has been much interest in MIMO detection using deep unfolding, which builds deep networks for MIMO detection by borrowing insights from other algorithms. Particularly, in DetNet [1], the first deep unfolding endeavor for MIMO detection, the authors consider a network structure that resembles a gradient descent algorithm where ρ α is a nonlinear activation function with parameter α, and η k is a step size. In the original DetNet work, ρ α is chosen as an elementwise clipping function, specifically, ρ α (z) = (ρ α (z 1 ), . . . , ρ α (z n )), ρ α (z) = ρ(αz), and ρ(z) = z, |z| ≤ 1 sgn(z), otherwise.
It should be noted that the original DetNet in [1] considers a network structure that is more general than (2), but the insight largely comes from the basic form in (2). We shall omit those details for the sake of simplicity.
A basic question is how we can give an explanation of (2). In particular, is there any basis that we can provide for (2) from a theoretical viewpoint? A way to explain (2) is to consider the non-convex proximal gradient (PG) method for (1). To describe this, consider the following problem where h : R n → R ∪ {+∞} may be non-convex. The nonconvex PG method for (3) is given by where η k > 0 is the step size, and denotes the proximal operator of h. The main problem (1) can be rewritten as problem (3) The corresponding PG method is given by We see that the above PG method resembles the DetNet structure (2). Particularly, the DetNet structure looks like a soft sign version of the PG method. In fact, the above DetNet explanation is well known in the literature. The next question is whether this explanation can be used to say that DetNet would usually converge to a good local minimum. It is known that, under some technical assumptions, the PG method can guarantee some form of convergence to a critical point of problem (3) [21]. But this is not very useful for providing an explanation for convergence to good local minima, as any point

III. AN EXPLANATION BY HOMOTOPY OPTIMIZATION
To attempt to provide a better explanation for DetNet, we consider the following problem where λ > 0. As illustrated in Fig. 2, h serves as a penalty that discourages x to fall outside {−1, 1} n . It can be shown that, for a sufficiently large λ, problem (5) is equivalent to the main problem (1) in the following sense.

Fact 1 ([22]):
If λ > L f , then any optimal solution to (5) is also an optimal solution to (1). The upshot with the formulation in (5) is that it is a continuous optimization problem. However, (5) is still a non-convex problem and can suffer from convergence to local minima. In [7] we employed a homotopy strategy to try to deal with this issue. The idea is to gradually increase λ when we try to solve (5). For example, by applying the previously reviewed non-convex PG method to (5), we may do where g λ (x) = f (x) − λ x 2 /2, ρ α is the elementwise clipping function (cf., the previous section), and {λ k } is a sequence that has a gradually increasing trend. The homotopy strategy is based on two arguments: (i) (5) should be easier to solve for smaller λ, as (5) is convex when λ = 0; (ii) a small change of λ should lead to mild changes with the optimization landscape. Hence, by gradually increasing λ, we hope that the algorithm will trace the solution path of (5) (with respect to λ) and lead us to the optimal solution of (1). The reader is referred to [7] for more inspirations with the homotopy strategy. In this present work, we will study how we can draw a connection between the homotopy strategy and the DetNet structure (2).

A. A HOMOTOPY METHOD AND DETNET
As a variation of the above homotopy strategy, consider the following. LetL be any constant such thatL > L f . Consider where γ > 0, and φ : R n → R ∪ {+∞} is some function that is β-strongly convex on [−1, 1] n . If we set γ = 0, (7) is the same as (5) with λ =L; and the latter is equivalent to the main problem (1). Also, if γ ≥ 1/β, h γ is convex. This leads us to a variation of the previous homotopy method: start with a large γ , and gradually reduce γ when we try to solve (7). Let us consider the non-convex PG method for (7): Here we consider a fixed γ , but later we will make γ varying as in (6). The PG method in (9) can lead to convergence to a critical point of problem (7) if the step size satisfies 0 < η k < 1/L f [21]. Let us choose η k = 1/L. Let denote the linear optimization (LO) oracle of h. From we have By putting (10) and η k = 1/L back to (9), and by allowing γ to be varying to perform homotopy optimization, we are led to the following PG-based homotopy method Intriguingly, the above method resembles the DetNet structure in (2). In particular, the LO oracle basically serves the same role as the DetNet activation function ρ α . Even more interestingly, we know what form φ takes if we want the LO oracle LO ψ γ k to be a clipping or sigmoid function.

Fact 2:
We have the following results.
(a) Let φ(x) = x 2 /2. The function φ is 1-strongly convex, and it holds that The proof of Fact 2.(a) is trivial. The LO oracle result in Fact 2.(b) was shown in [23]. The strong convexity result in Fact 2.(b) can be shown by checking the Hessian of φ; specifically it can be verified that (12) has ∇ 2 φ(x) = 2 1−x 2 for x ∈ (−1, 1). To give more insights, Fig. 3 plots h γ for the above two cases of φ.
To summarize, the DetNet structure (2) may be explained as an outcome of the PG-based homotopy method (11). The choice of the strong convex function φ in the homotopy method leads to the form the DetNet activation function takes.
The homotopy parameter γ and the activation function parameter α have a direct correspondence, specifically, γ = α −1 . This gives rise to the implication that, if we choose a smaller γ so that problem (7) approximates the main problem (1) better, and we choose one of the φ's in Fact 2, the corresponding activation function ρ α is closer to the sign function-this makes sense, intuitively.

B. FRANK-WOLFE DETNET
Previously we implemented the homotopy strategy by the non-convex PG method. We can consider other implementation alternatives, and we want to see what new DetNet structures will arise from such alternatives. In this subsection we consider the non-convex Frank-Wolfe (FW) method [24], [25]. To describe it, reformulate (7) as The FW method for (13) is given by where ξ k ∈ [0, 1] is a step size. Under an appropriate stepsize rule, such as the FW method can lead to convergence to a critical point of problem (13) [24], [25]. It can be verified that ) for some activation function ρ α . By having γ varying in iterations, we are led to a FW variant of the DetNet structure where α k , η k , ξ k are the parameters to be learnt.

C. ADMM DETNET
We consider the non-convex alternating direction method of multipliers (ADMM) method [26], [27] as another homotopy implementation alternative. Rewrite (7) as The associated augmented Lagrangian function is given by where u is the dual variable associated with the constraint z = x, and τ > 0 is a given parameter. The non-convex ADMM method for (14) is given by where Diag(τ) denotes a diagonal matrix with diagonal elements being τ, and is the elementwise product. Here, the function φ associated with h γ is assumed to take one of the forms in Fact 2. In the z update step, if we choose τ i =L for all i, then, by the same spirit of the preceding study, we have

This leads to an ADMM DetNet variant
where τ k , α k , η k are the parameters to learn. Equation (15) resembles the ADMM DetNet in [5]. It is interesting to note that the choice τ i =L for all i coincides with one of the sufficient conditions for the non-convex ADMM method to converge to a stationary point of problem (14); see [26], [27].

D. BEYOND THE {−1, 1} CASE
We can generalize the result to the multilevel case min x∈{±1,±3,...,±(2D−1)} n f (x), (16) where D is a positive integer. Consider the following problem where As illustrated in Fig. 4, H is a multilevel extension of the penalty h in (5). We can show that, for a large enough λ, (17) is an equivalent formulation of (16).

Proposition 1:
If λ > L f , then any optimal solution to (17) is also an optimal solution to (16).
Proof of Proposition 1: Problem (17) can be rewritten as where we perform a change of a variable x = x − c. Consider the inner minimization problem in (19). Since f has L f -Lipschitz continuous gradient, f (x + c) also has L f -Lipschitz continuous gradient. By Fact 1, if λ > L f , then any optimal solution to the inner minimization problem in (19) is an optimal solution to min x ∈{±1} n f (x + c). Putting this result back to (19), (17) is seen to be equivalent to (16). The proof is complete. Next we examine the application of the homotopy strategy in Section III-A to (17). Consider whereL and h γ follows the definition in (8). We should note the caveat that H γ is non-convex in general; see Fig. 5. This is unlike the binary counterpart h γ . Still we want to see what happens. Consider the following result.
The proof of Proposition 2 is relegated to the Appendix. By following the same non-convex PG development in Section III-A, the PG-based homotopy method is given by where P α (z) = (P α (z 1 ), . . . , P α (z n )), We see that (22) resembles the DetNet structure (2). Fig. 6 illustrates P α for the sigmoid case. In the figure we also illustrate the typically used multilevel sigmoid function We see thatP α closely approximates P α when α is large. Hence, by approximating P α withP α in (22), we are led to a multilevel version of the DetNet in (2). Following this same line of development, we also have a FW DetNet structure where {ξ k , α k , η k } are the trainable network parameters.

IV. SIMULATIONS
While providing explanations and drawing connections are the focus of this study, it is still interesting to show some numerical results. We train DetNets according to the DetNet structures shown in the last section. Specifically, for the PG DetNet structure (22), we consider the following variant where z k is an extrapolated point between two successive layers,P α is the multilevel sigmoid function, and {α k , ξ k , γ k } are the trainable network parameters. The above structure takes insight from the accelerated PG method for convex optimization [28], [29]. For the FW DetNet structure, we follow (23). For the ADMM DetNet structure, we follow (15). In the training phase, we randomly generate {x, H, y} in each training epoch. We optimize {α k , ξ k , γ k } to minimize the loss function The training phase is implemented on pytorch 1.12 on Python 3.8. We use Adam stochastic gradient descent with learning rate 0.001. The batch size in each training iteration is 200. We set K = 20 layers. We are also interested in the homotopy methods, or the non-deep counterparts of DetNet. We implement the homotopy PG (HoT-PG) method in (22). We initialize HoT-PG by the solution of box relaxation (i.e., problem with h(x) = I [−(2D−1),2D−1] n (x)), or equivalently, γ 0 = 2. We start the algorithm with γ 1 = 3, and gradually reduce γ k by γ k+1 = γ k /β k for some β k ≥ 1. We set β k+1 = 2 if x k+1 − x k ≤ for = 10 −3 , and we set β k+1 = 1 otherwise. We stop the algorithm when γ k is sufficiently small. We also implement the homotopy FW (HoT-FW) method in (23) and homotopy ADMM (HoT-ADMM) by the same way.
We consider the standard MIMO detection problem. The simulation data is first generated by a complex signal model y =Hx +ṽ, whereH is the complex MIMO channel,x is a complex QAM symbol vector, andṽ is noise. The channel H is randomly generated, following an element-wise independent and identical distributed (i.i.d.) circular Gaussian distribution with zero mean and unit variance. The noise term v is generated as an element-wise i.i.d. circular Gaussian random vector, with zero mean. We transform the complex model to the real model y = Hx + v by We also consider the DetNet originally developed by Samuel, Diskin and Wiesel [1]. We will call this DetNet "OrigDetNet," to avoid confusion with the above introduced PG, FW and ADMM DetNets. OrigDetNet is implemented by the open source code 1 . In addition, we consider several classic MIMO detectors that are not deep learning based. Specifically, for 4-QAM (or the {−1, 1} case), we consider the linear minimum mean square error (MMSE) detector, semidefinite relaxation (SDR) with fast row-by-row implementation [30] and box relaxation (h(x) = I [−(2D−1),2D−1] n (x) in (5)); for 16-QAM (or the {±1, ±3} case), we consider MMSE, BC-SDR [31] and box relaxation.
We show the bit-error rate (BER) performance. The number of trials to obtain the BER results is 10,000. We implement by MATLAB and ran our simulations on a laptop with Intel Core i7-9750H and 16 GB memory. First, consider the 4-QAM case. Fig. 7 shows the result for a 60 × 40 MIMO system, which is a setting considered by OrigDetNet [1]. It is seen that, except for MMSE, all the algorithms show comparable performance, and their BER performances are within 3 dB from the no-interference lower bound. In particular, although PG, FW and ADMM DetNets are simpler than OrigDetNet and have much less network parameters than OrigDetNet, they show promising performances as OrigDetNet does. Also, we see that PG, FW and ADMM show enhanced performance over HoT-PG, HoT-FW and HoT-ADMM, their non-deep counterparts.
We increase the difficulty by considering critically determined MIMO systems. The results are shown in Figs. 8 and 9. The OrigDetNet work [1] did not consider these settings. We tried to train OrigDetNet under these settings, but we were   not successful in training a high-performance OrigDetNet. Luckily, we found that PG, FW and ADMM DetNets are easy to train under these settings, and their performances are promising. In addition, the BER curves of HoT-PG, HoT-FW, HoT-ADMM and box relaxation are several dBs away from the no-interference performance lower bound.
The average runtimes of considered algorithms are shown in Table 1. It it seen that PG, FW and ADMM DetNets are computationally competitive; this is because their structures   are simple and they use only 20 layers. By comparison, HoT-PG, HoT-FW and HoT-ADMM need more iterations in applying homotopy optimization, and they consume more runtime. Now we turn our attention to the 16-QAM case. Figs. 10-11 show the BER performance under different problem sizes. OrigDetNet did not consider such large MIMO sizes, and we were unsuccessful in obtaining reasonable results as of the writing of this paper. Hence we do not include OrigDetNet in the simulation. It is seen that the PG-, FW-and ADMM-based homotopy methods, with and without deep unfolding, work well. Also, with deep unfolding, PG, FW and ADMM DetNets again show enhanced detection performance. The average  runtime performances of these algorithms are summarized in Table 2. Again, we see that PG, FW and ADMM DetNets are computationally competitive.
We have been showing positive results for PG, FW and ADMM DetNets. Fig. 12 shows a less satisfactory result, wherein we consider a critically determined MIMO system and 16-QAM. Perhaps more complex network structures and more advanced deep learning skills may help improve the performance, and this is left as future work. Also, HoT-PG, HoT-FW and HoT-ADMM give similar performance as in the previous cases.
It is interesting to unpack the deep MIMO detectors and see the trend of the learned homotopy parameter {α k }. Fig. 13 shows the learned homotopy parameter sequences {α k } of the PG DetNet under 4-QAM and 16-QAM. Interestingly, the learned sequence exhibits an increasing trend, which agrees with the rationale of homotopy optimization.

V. CONCLUSION
To conclude, we drew a connection between DetNet and homotopy optimization. Using such connection, we argued that DetNet can be explained as a homotopy method. As future work, it would be interesting to study how this connection provides further insights into building better DetNets for more challenging scenarios (such as the one in Fig. 12) and for other types of constellations.

APPENDIX PROOF OF PROPOSITION 2
The problem associated with prox H γ can be expressed as min z∈R,c∈C 1 2 |z − x| 2 + h γ (z − c).