Codebook Training for Trellis-Based Hierarchical Grassmannian Classiﬁcation

—We consider classiﬁcation of points on a complex-valued Grassmann manifold of m -dimensional subspaces within the n -dimensional complex Euclidean space. We introduce a trellis-based hierarchical classiﬁcation network, which is based on an orthogonal product decomposition of the orthogonal basis representing the m -dimensional subspace. Exploiting the similarity of the proposed trellis classiﬁer with a neural network, we propose stochastic gradient-based training techniques. We apply the proposed methods to two important applications in wireless communication, namely Grassmannian channel state information quantization in multiple-input multiple-output communications and non-coherent Grassmannian multi-resolution transmission.


I. INTRODUCTION
C OMPLEX-VALUED Grassmann manifolds play a prominent role in wireless communications. Grassmannians have been successfully applied for non-coherent transmissions over block-fading channels [1]- [11], as well as, for channel state information (CSI) quantization in limited feedback based multiple-input multiple-output (MIMO) communications [12]- [16]. In both of these contexts, points on a Grassmann manifold have to be classified at the receivers, either to detect the transmit symbols or to quantize the CSI. Thus, computationally efficient Grassmannian classification is required to support a real-time application of such techniques. In [17], an autoencoder deep neural network has been proposed to tackle this problem.
Real-valued Grassmann manifolds have also found application in image classification problems [18]- [20]. In contrast to wireless communication problems, the dimensions in image classification are commonly much larger, however, computational complexity is not a major concern and therefore very complex classification networks can be implemented.
Contribution: In this letter, we consider a product decomposition of points on a Grassmann manifold as a basis for a hierarchical classifier. We approximate the joint hierarchical classification problem by a trellis-based classifier, generalizing our approach of [10] from one-dimensional to arbitrary m-dimensional subspaces. Our initial approach to this problem in [21] was a greedy recursive classifier, which has the advantage that it allows for an analytic performance investigation; Manuscript  however, it entails a significant performance loss compared to a joint classification. In [22], we have therefore generalized the recursive classifier to a tree-based structure, which allows to trade-off performance for complexity by performing a pruned tree search. The trellis classifier proposed in the present paper is basically an approximation of a full tree search, achieved by folding the branches of the tree on top of each other. This approach allows to perform classification on high-dimensional Grassmannians at relatively low complexity. It furthermore supports efficient classifier codebook training employing stochastic gradients and backpropagation, similar to a neural network. We apply the classifier to two important applications in wireless communications, namely Grassmannian CSI quantization and non-coherent Grassmannian transmission. Notation: The Grassmann manifold of m-dimensional subspaces of the complex-valued n-dimensional Euclidean space is G(n, m), n > m. The conjugate-transpose of matrix A is A H , the Frobenius norm is A and the d-th diagonal is diag(A, d). We use I m for an m × m identity matrix and 1 m for a length m all-ones vector. The subspace spanned by the columns of an orthogonal basis U ∈ C n×m is span(U). The operation a max = arg max a∈A f (a) determines the maximizer a max of the function f (a) over the set A. The size of a set A is |A|. The expected value of a random variable r is E(r ).

II. HIERARCHICAL GRASSMANNIAN CLASSIFICATION A. System Model
We consider points on a complex-valued Grassmann manifold G(n, m), represented by corresponding orthogonal bases U ∈ C n×m , U H U = I m . These points are hierarchically classified within R Grassmann manifolds G(n, Hierarchical subspace classification is based on an orthogonal product decomposition of the subspace span(U) as follows Matrix Q (i) represents a point on G(d i−1 , d i ) and the partial product U r = r i=1 Q (i) , being an orthogonal basis itself, corresponds to the hierarchical product decomposition of layer r with span(U r ) ∈ G(n, d r ).

B. Hierarchical Classifier
In hierarchical subspace classification, the exact product decomposition (1) is replaced by the following codebook based (2) Here, the orthogonal bases Q (i) j i of the R layers are taken from sub-codebooks Q An optimal solution of this hierarchical classification problem requires a joint search over the product codebook Q which is practically not feasible for larger sub-codebook sizes.

C. Trellis-Based Classification
The joint classification (2) can be approximately solved by a trellis structure as illustrated in Fig. 1. In the first layer of this trellis the layer 1 input B 0 = U is projected onto the subspaces Q Matrices B (1) 1 are then fed into the second layer of the trellisnetwork, where each is again projected onto the corresponding layer 2 subspaces Q However, at this point we do not propagate all of these matrices further down the network, but rather decide in each path for one single matrix that provides the largest partial trellis metric In the same way, we proceed through the entire trellis until the last layer R of the network, where we additionally perform a final decision on the active trellis path The where we use the short-hand notation j * i for the backpropagated indices of the active trellis path, i.e., j * (7) recursively.
In general, this trellis approach performs slightly worse than a joint optimization; however, our prior studies [10] for m = 1 demonstrate that the loss is not significant. In terms of complexity the trellis approach provides a gain, as the total number of codebook searches is reduced from For example, for b = 32 bits equally distributed amongst R = 8 layers the number of codebook searches is reduced by a factor of more than 10 7 . For a further complexity reduction, it is also possible to prune the trellis in each layer to a few paths with largest norms, without impairing the performance substantially.

D. Application Scenarios
The considered product classification (2) finds application in two scenarios that are relevant for wireless communications: a) Grassmannian CSI quantization in MIMO communications [21], [22]; b) Non-coherent multi-resolution transmission [10], [11]. We briefly outline this connection below.
1) Grassmannian CSI Quantization: In MIMO wireless communications, CSI at the transmitter (CSIT) about the wireless channel matrix H ∈ C Nt ×Nr , where N t = n and N r = m < n denote the numbers of transmit and receive antennas, is frequently obtained by limited feedback from the receiver. For many transmit strategies, the relevant CSIT is the subspace span(H) ∈ G(n, m). Representing span(H) by an orthogonal basis U, the CSI feedback is obtained from the following quantization problem where Q (n) m ⊂ G(n, m) denotes the finite size quantization codebook. For large-scale MIMO systems, the required codebook size D = |Q (n) m | = 2 b to achieve a sufficiently small quantization error can become prohibitively large [23]. A hierarchical product codebook together with the proposed trellis classifier can alleviate this problem.
2) Non-Coherent Multi-Resolution Transmission: In noncoherent Grassmannian transmission, the information is encoded in matrix U ∈ C n×m taken from a Grassmannian symbol constellation/codebook Q (n) m ⊂ G(n, m). U is transmitted during n time instances over the block-fading MIMO channel H ∈ C Nt ×Nr , where N t = m and N r ≥ m. The corresponding input-output relationship is where Z ∈ C n×Nr denotes the unit-variance channel noise and ρ is the signal-to-noise-ratio (SNR). Non-coherent detection of U can be achieved bŷ where U y ∈ C n×m is an orthogonal basis for Y's left singular vectors corresponding to the m largest singular values.
In the non-coherent multi-resolution transmission scheme of [10], U is constructed from R streams utilizing (1) Joint detection of these R streams leads to the same classification problem as in (2) and can therefore be efficiently approximated by the proposed trellis classifier.

E. Classifier Training
The subspaces Q (i) i of the individual layers of the trellis can be trained for different tasks, similar to the weights of a neural network. In this section, we develop unsupervised learning algorithms based on stochastic gradients and back-propagation, for our two specific application scenarios of Grassmannian CSI quantization and non-coherent multiresolution transmission. In both cases, the training will in general only provide a local optimum; thus, multiple training runs with different initial random states of the subspaces Q (i) i can be advantageous. 1) Grassmannian CSI Quantization: In this application, the goal commonly is to find a quantization codebook that minimizes the average normalized chordal distance distortion For isotropically distributed subspaces span(U) of the source samples, it is well known that this is achieved by maximally spaced subspace packings [24]. Yet, in general, the source samples may not be isotropically distributed, implying that the quantization sub-codebooks Q of the trellis should then be adapted to the actual distribution of the source. This can be achieved by training the classification network with similar approaches as in deep neural network training, namely stochastic gradients and back-propagation.
Specifically, given a training sample U, we first propagate U through the trellis and determine the corresponding active indices {j * 1 , . . . , j * R } and classified outputÛ according to (7) and (8). To train layer r, we update the corresponding codebook entry Q (r ) j * r such as to increase the quantization metric U HÛ 2 via a stochastic gradient step. We therefore calculate the gradient of U HÛ 2 w.r.t. Q Next, we project this gradient onto the tangent-space of Q (r ) j * r to find an ascent direction on the manifold [25] ∇ (t) Finally, we move along the geodesic defined by the projected gradient to update the codebook entry [26] Here, matrices U r ,Σ r and V r are obtained from a compact singular value decomposition (SVD) of the projected gradient. The step-size/learning-rate t can be tuned for convergence speed and accuracy [27]. These stochastic gradients steps are performed recursively for each training sample U, starting at the last layer R and back-propagating to the first layer. Thereby many calculations, especially for matrices A r and B r , can be reused when moving from one layer to the next to reduce the training complexity.
Finally, these learning steps are repeated over a large training set of independent samples U. To achieve convergence, the step-size t is reduced throughout the learning process.
2) Non-Coherent Multi-Resolution Transmission: For this application, it is known that the maximum likelihood detection performance is optimized if the minimum distance of the effective product symbol constellation Q [11]. We therefore propose to train the classifier to improve the pairwise minimum distance of constellation points as described below.
First, in each training iteration, we randomly sample a constellation point U from the current product symbol constellation Q We then propagate U through the trellis and determine the indices not only of the best trellis path corresponding toÛ = U according to (7), (8), but also of the second-best path: corresponding toÛ 2nd .
To train layer r, we increase the distance betweenÛ and its nearest trellis neighborÛ 2nd via a stochastic gradient step.
We therefore calculate the gradient ∇ 2nd , similar to (14), and project it onto the tangent space of Q (r ) j 2nd r , similar to (15). As we intend to increase the pairwise distance, we now have to move along the geodesic defined by the negative projected gradient −∇ 2nd,(t) r . At the same time, however, we have to be careful not to move too close to other constellation points and thereby reduce the distance to them too much. This optimization therefore requires a smart selection of the step-size t.
Specifically, we perform a line-search over t, where we increase t as long as the minimum distance w.r.t. neighboring constellation points ofÛ 2nd is improved. Neighboring constellation points in layer r are defined as those that differ fromÛ 2nd only in the corresponding layer r codebook . Details are described in Algorithm 1.

III. SIMULATIONS A. Grassmannian CSI Quantization
We consider quantization of orthogonal bases U ∈ C 64×4 corresponding to points on G (64, 4). These bases are generated from spatially correlated Rayleigh fading MIMO channels H ∈

Algorithm 1
Step-Size Line-Search 1: Initialize pairwise distance of active trellis pair 2: Initialize sub-constellation distances of layer r C 64×4 assuming a Kronecker correlation model [28] whereH ∼ CN (0, 1) is an independent and identically distributed (i.i.d.) Rayleigh fading matrix and C is the correlation matrix, parametrized as diag(C, d ) = α |d| 1 n−|d| . We train the trellis on a training data set consisting of 3e4 samples and we evaluate the performance on a test data set of size 1e3 samples. 1) Convergence: In Fig. 2, we exhibit the convergence behavior of the learning process for correlation coefficient α ∈ {0.6, 0.7, 0.8, 0.9}. We consider a trellis with R = 20 layers and a sub-codebook size of D i = 2 b i = 8, ∀i ; hence, the total number of quantization bits is b = b i R = 60. During training, we gradually reduce the gradient step-size t from initially 0.1 down to 0.01. As we can see in Fig. 2, with lower correlation α learning only occurs later, when the step size is sufficiently small. Furthermore, we observe that the steadystate convergence value is the same as the average distortion achieved on test data.
2) Benchmark Comparison: We benchmark our scheme against k-means vector quantization as provided by MATLAB, since we are not aware of another applicable scheme that can handle large-dimensional correlated data. Applying k-means clustering directly on the 64 × 4 dimensional training samples is computationally not feasible for larger numbers of bits. To reduce complexity, we therefore partition U into nonoverlapping vectors and apply k-means clustering on each vector individually. Specifically, we partition U into {4, 8, 16} vectors of length {64, 32, 16} and utilize a total of b ∈ {52, 104, 160} bits to quantize the vectors. For reconstruction, we concatenate the quantized vectors and apply an SVD to obtain an orthogonal basisÛ (post-orthogonalization).
For trellis-based quantization, we consider b ∈ {16, 64, 128} bits, R = 60 layers and a dimension step-size of Δ i = 1, ∀i . Small Δ i provides the lowest quantization complexity, because the total number of b bits is divided amongst many layers. Equal partitioning of the b bits amongst all layers does not achieve the lowest possible distortion. Thus, a hyperparameter optimization over the number of b i bits per layer is generally required. As a guideline for this hyper-parameter search we can utilize a result of [21]: the recursive multilayer quantizer of [21] achieves the lowest distortion, when the bits are partitioned such that each layer contributes equal distortion. Although we have not yet been able to generalize this result to trellis-based quantization, it still provides a good starting point for a hyper-parameter search. For the relatively small number of bits considered in our simulation, compared to the dimensions of the quantization problem, it turns out that the bits should mostly be assigned to the latter layers of the trellis, whereas preceding layers get only single-element subcodebooks assigned, which are trained to fit the mean value of the training samples U. In our simulation, this leads to assigning bits only to the last 3, 10 and 18 layers for b ∈ {16, 64, 128}, respectively.
In Fig. 3, we show the results of our benchmark comparison. As can be seen the number of bits used for k-means has been selected such as to match the performance of trellisbased classification. Especially for smaller number of bits the proposed trellis quantizer is much more efficient than k-means with post-orthogonalization.

B. Non-Coherent Multi-Resolution Transmission
For non-coherent data transmissions, the minimum distance properties of the symbol constellation govern the achievable performance. We thus investigate the average pairwise minimum distance of our trained constellations, as a function of the number of b bits encoded in the constellation. As a benchmark, we utilize another numerically optimized symbol constellation, namely the direct constellation design of [29], in which the Fig. 4. Average pairwise minimum distance of our trellis-based symbol constellations compared to the direct design of [29]. The constellations are designed for transmission over n = 8 time instances and m = 2 antennas, with equal bit-partitioning amongst the R multi-resolution streams.
points of the symbol constellation are jointly optimized by applying a smooth approximation to maximizing the minimum distance. This optimization can be solved by stochastic gradient techniques, similar to our approach when the number of layers R = 1, with the difference that the gradient in [29] is jointly calculated from all constellation points, whereas we consider in each step only a single pair of constellation points, as described in Section II-E, to reduce the complexity.
The results are shown in Fig. 4 for different numbers R of non-coherent multi-resolution streams. For R = 1 our approach achieves the same performance as the direct optimization approach. However, the training complexity is lower as it is based only on a single pair of constellation points and therefore larger constellations can be trained. By increasing the number R of layers, the large product constellation is split into smaller sub-constellations per layer, which allows to train even larger symbol constellations, yet with slightly worse distances.

IV. CONCLUSION
We have presented a trellis-based hierarchical subspace classification network, which can be trained to perform different tasks. Specifically, we have applied the classifier in the contexts of Grassmannian CSI quantization and non-coherent multiresolution transmissions. The main advantage of the proposed classifier is that it provides substantially reduced complexity compared to a single layer classifier and can therefore support classification on higher dimensional Grassmann manifolds.