MetNet: A Novel Low-Complexity Neural Network-Aided Detection for Faster-Than-Nyquist (FTN) Signaling in ISI Channels

This paper studies the application of neural networks to Viterbi detection of FTN signals in an intersymbol interference (ISI) channel. The main contribution of this paper is to propose a receiver structure for detecting FTN signals in unknown static ISI channel. In particular, we propose a novel low-complexity neural network structure for calculating the branch metrics, and we explore its suitability for FTN signalling with channel uncertainty. We compare the proposed network, which we call the Metric Net (MetNet), to a benchmark neural network-based technique for metric calculation, the ViterbiNet, which was originally designed for ISI channels. The simulation results confirm that the MetNet outperforms the ViterbiNet, with two orders of magnitude lower complexity, and is much more resilient to channel uncertainty than the traditional Viterbi detector, which uses Euclidean distance for metric calculations. We further show that the MetNet exhibits robustness to being trained at mismatched SNR values and FTN pulse acceleration factors, meaning that the number of trained models required can be significantly reduced. Additionally, the results show that the proposed MetNet remains a favorable alternative at much higher levels of channel uncertainties. The results also reflect that we can generalize the MetNet to work with different channel models defined by different decaying factors. Finally, we show that we succeed in achieving a bandwidth efficiency gain of 33% due to FTN by using the MetNet in the presence of channel uncertainty.


C ONTEMPORARY communication systems leverage
Nyquist signaling techniques to avoid intersymbol interference (ISI), however, with recent advances in silicon technology, some ISI can be handled at the receiver in order to achieve higher rates. Faster-than-Nyquist (FTN) signalling is a promising transmission technique which improves spectral efficiency within the same operating bandwidth through accelerating signal transmission by a factor of 1/τ , where 0 < τ ≤ 1. This results in self-induced pulse-shaping ISI that can be handled at the receiver to a certain extent. The authors in [1] show that using uncoded root-raised cosine transmit pulses with roll-off β = 0.3, τ can go down to 0.703 without bit error rate (BER) performance loss -the minimum value of τ we can operate at without observing any performance degradation asymptotically at high SNRs, is known as the Mazo limit. This result can be achieved when using optimal detection based on maximum likelihood sequence estimation [2], or in other words, using a Viterbi detector, which is inherently optimized to work with white noise.
In this paper, we explore the performance of FTN in an ISI channel, where we have an additional and unknown source of ISI underlying within the channel, on top of the pulse-shaping ISI in FTN -this results in longer and more severe ISI. At the receiver we compare the performance of a regular Viterbi detector against neural network aided metric calculators for Viterbi detection, in the presence of channel state information uncertainty.
Different FTN receivers were explored in [3], [4]. The authors in [3] established a low complexity, sub-optimal FTN detector based on convex relaxation, primal-dual-predictorcorrector interior point method, and quantization. While the authors in [4] exploit a mathematical programming technique based on the alternating directions multiplier method to design an FTN detector for ultra-high modulation orders up to 64K with notable spectral efficiency gains. Channel estimation issues for FTN signals in ISI channels where the channel state information (CSI) is unknown were investigated [5], [6] using pilot symbols. The performance of FTN in different multi-path channels was explored in [7], [8]. Furthermore, machine learning (ML) based receivers for FTN systems were designed in [9], [10].
Other ML applications outside FTN for ISI channels without perfect CSI knowledge were explored in [11], which introduced the sliding bidirectional recurrent neural network (SBRNN), a neural network-aided receiver showing resiliency to CSI uncertainty over the model-based counterparts. This idea gave rise to other neural network (NN) aided receivers such as the BCJRNet [12], based on the BCJR algorithm [13], and ViterbiNet [14], which we will be comparing against a new proposed neural network architecture, which we call the Metric Net (MetNet).
The main contribution of this paper is to propose a receiver structure for detecting FTN signals in unknown static ISI channel. In particular, we propose a novel lowcomplexity neural network structure for metric computation of the Viterbi Detector, which does not require accurate channel state information during training or deployment. We present the performance of the MetNet in an FTN system under different levels of CSI uncertainty. We compare the MetNet to the ViterbiNet in terms of computational complexity and performance, as well as comparing it to the traditional method of computing metrics for Viterbi detection using Euclidean distance (ED) [15]. We also explore the MetNet's performance when trained with no noise at all, as well as tuning the number of channels used during training as well as the number of epochs. Moreover, we explore the network's ability to perform over different ISI channels governed by decaying factors, γ , instances, when its trained using a dataset generated from different channels over a range of γ -values. Further, we test the resiliency of the MetNet to being trained on different τ values than the actual. Additionally, we present the performance of the MetNet with very high levels of channel uncertainty, and finally, we report on the bandwidth efficiency gains when using the MetNet in the FTN system.
Most neural network-based receivers designed for general ISI aim to replace model-based receivers and achieve desired results through proper training and tuning, but for the most part the network-based receiver is treated as a black box [11], [16], whereas the MetNet, similar to the ViterbiNet, embeds a neural network within a known modelbased detector. Moreover, some NN-based receivers in ISI channels leverage recurrent neural networks or other architectures with memory to learn the underlying correlations in ISI channels, however, the MetNet does not leverage them to keep the complexity low while still able to achieve superior BER performances.

II. RELEVANT LITERATURE OVERVIEW
The work in this paper is motivated by previous works that investigate deep learning techniques used to train detection algorithms using simulated samples of received signals, without accurate knowledge of the underlying channel coefficients. The work in [11] proposes a technique based on recurrent neural networks (RNNs) that is trained on a diverse dataset that contains received samples from different realizations of channel conditions, which results in a detector that is robust to channel uncertainty, the detector is called a sliding-bidirectional RNN (SBRNN). The SBRNN was shown to approach the BER performance of an optimal Viterbi detector when the channel conditions are perfectly known, and to outperform the Viterbi detector under CSI uncertainty.
Following the SBRNN, another NN-based detector was developed, called the ViterbiNet, which achieved superior results to the SBRNN [14]. The ViterbiNet, just like the MetNet in this work, acts as a metric calculator which feeds likelihood values into a Viterbi detector, where conventionally the Viterbi detector would use the Euclidean distance to compute these metrics instead. Similar to the SBRNN, the ViterbiNet outperforms traditional metric calculation in Viterbi detectors in the presence of CSI uncertainty. Similar results are demonstrated when using the BCJRnet [12], which is a network that learns a factor graph representing the channel, and carries out symbol recovery using a sum-product algorithm.
Building on top of the ViterbiNet, the Meta-ViterbiNet was proposed in [20] to enable the network to stay viable with more diverse and dynamic channel conditions through rapid online retraining of the network. This approach shows advantages over the static training of a dataset generated through different channel conditions realizations for two main reasons, first being the dataset needs to be large enough to represent all the different conditions, and secondly since the second approach struggles when the channel conditions greatly deviate from training conditions. Online training represents a valuable future work for our MetNet, especially with its low complexity nature making it suitable for frequent retraining, but it is outside the scope of this work.
Other interesting deep learning-based detectors for FTN signals include the work in [9] which introduces a joint deep learning-based detector followed by a successive interference cancellation block that calculates and subtracts the interference from received signals to obtain more accurate log-likelihood ratios.

III. SYSTEM MODEL
The system implemented is an FTN-based wireless system over an ISI channel with a Viterbi detector, as shown by the block diagram in Fig. 1. With FTN signalling, symbols are transmitted at a higher rate, but the pulse width is not changed. That is, the symbol period is reduced to τ T, where 0 < τ < 1 is known as accelerating (or squeezing) parameter, so the transmitted signal becomes where {v n } are the transmitted symbols, T is the symbol period, and h T (t) is the root raised cosine pulse shape with roll-off factor β.
We consider an L c -tap ISI channel, with impulse response where α l is the gain of the l th -tap. We also explore different levels of channel state information (CSI) at the receiver. The combination of the root-raised cosine pulse shape, the L c -tap ISI channel with channel gains of α l for l ∈ [0, 1, . . . , L c −1] and additive white Gaussian noise (AWGN) with power spectral density (PSD) of N 0 , the matched filter and the noise whitening filter, can be modeled as the equivalent discrete-time channel given by where v n are the transmitted symbols, h n are the effective channel taps, L is the effective channel length including the ISI from both the channel and the FTN pulses, and w n is the noise, modeled by i.i.d Gaussian random variables with zero-mean and variance N 0 /2. The effective channel taps are given by where α l are the channel gains of the underlying ISI channel and f RC,n are the coefficients of the factorization of the samples of a raised cosine pulse, h RC,n = h RC (nτ T).

A. METRIC COMPUTATION FOR VITERBI DETECTION
The Viterbi algorithm can be viewed as a two-step process, the first involves calculating the metrics, the next step is using those metrics to find the most likely sequence of transmitted symbols. The branch metrics, μ n (r n |v n , v n−1 , . . . , v n−L+1 ), for the Viterbi algorithm at time n, are the negative natural log of the likelihood functions of the current received sample, r n , given a hypothetical transmitted symbol, v n (corresponding to a branch between two states) and the previous L − 1 symbols, v n−1 , . . . , v n−L+1 (corresponding to the state). For the purposes of the VA with AWGN, this is equivalent to the more commonly used Euclidean distance (ED), This means that typically, in order to compute the metric, we require knowledge of the channel taps, h l . This could be prohibitive in complex or dynamic channels which usually resort to sending reference signals often which reduces the spectral efficiency.
Alternatively, instead of computing these metrics using the Euclidean distance, we can train a neural network to learn those metrics using an offline supervised learning approach, by using simulated received signals as our the features and transmitted signals as the true labels. The MetNet, similar to the ViterbiNet, provides an alternative way of metrics computation, with the benefit of not requiring perfect knowledge of the channel when doing so. We will expand on the architectures of both the ViterbiNet as well as the MetNet in the following.

IV. PROPOSED NEURAL NETWORK (METNET)
The motivation behind the MetNet is to train a network that aims to learn and estimate the metrics of the trellis and then uses those metrics for Viterbi detection. The aim is to also establish a significantly lower complexity alternative to the ViterbiNet, that achieves better BER performance under different levels of channel uncertainty and decaying factors in an FTN system.
The architecture of the MetNet, shown in Fig. 2 is a low complexity one, consisting of just a 1 x M L fully connected layer (FCL), followed by a custom activation layer, a softmax layer and a classification output layer. The custom activation layer applies the following function which ensures that the output of the softmax layer that follows it, closely follows a Gaussian probability density function (PDF). The output of an N out × N in FCL is a vector y = Wx + b, where W is the N in × N out weight matrix, and b is the N out × 1 bias vector. The weights and biases are optimized during training. The softmax layer activation function is the output at the softmax layer is an estimate of the a posteriori probabilities Pr{v n , v n−1 , . . . , v n−L+1 |r n }. The last classification output layer calculates the cross-entropy loss for classification using Adam optimizer [19], which is used for training. Prior to deployment, the neural network is trained based on a sufficient number of simulated received samples corresponding to known transmitted symbols with only imprecise knowledge of the potential state of channel during deployment. During deployment, the trained neural network is sequentially fed each received sample, r n , to generate a posteriori probabilities of all M L combinations of the current and previous L − 1 symbols, which are converted to branch metrics and fed into the Viterbi algorithm. Training for several model is only done once offline.
Whether a neural network is used for metrics computation or not, the next step remains the same within the Viterbi algorithm, the metrics calculated are used to find the shortest (survivor) path to compute an estimate for the transmitted symbols sequence.
In the next section, we perform a complexity analysis of our MetNet against the ViterbiNet.

A. COMPLEXITY ANALYSIS AGAINST VITERBINET
The authors in [14] presented a deep neural network (DNN), called the ViterbiNet (VN). The architecture of the VN consists of a 1 × 100 FCL, followed by a sigmoid activation layer, a 100 × 50 FCL, a rectified linear unit (ReLU) activation layer, a 50 × M L FCL (where M is the modulation order and L is the channel length), and a softmax activation layer. The sigmoid activation function is applied to the input so that the output is bound to the interval (0,1), using the function The ReLU is simply a threshold operation that ensures the output is non-negative by setting any negative input value to zero. Additionally, the authors incorporate a Gaussian mixture model, that is used to estimate the marginal distribution of the channel output samples, f (r n ), that in turn is used to convert the a posteriori probabilities to likelihood function values according to Bayes' rule.
We compare the complexity of the MetNet with the VN during the deployment phase (after the network has trained). Our analysis assumes that multiplications, additions, subtractions and divisions all count as a single floating point operation (flop), and exponentials and logarithms are implemented using 5 th -order rational polynomials which translates to around 20 flops each.
The VN's first fully connected layer (FCL) of size 1×100, multiplies by the weight and adds the bias at each of the 100 nodes, which translates to 200 flops. The sigmoid activation performs an addition, a division, and an exponential at each node, translating to 2200 flops. The second 100 × 50 FCL requires 50 × 100 × 2 flops, and the intermediate ReLU activation is a simple threshold operation, so there are no operations involved since it is just a comparison with zero. The last 50 × M L FCL requires M L × 50 × 2 flops.
These layers are followed by the softmax activation layer, as well as additional processing to convert the a posteriori probabilities to likelihood functions and a negative log to provide the branch metrics. Although these three steps were included in the system described in [14], they serve no valuable purpose after the network has been trained, because the output of the last FCL could have been directly fed into the VA with absolutely no difference in system performance. This is because they involve converting between log and linear domains and back again, and including additive constants to all branch metrics at a given time which will not affect the VA's decisions. We therefore do not include the complexity of these three steps into our analysis. Therefore the total number of operations of the VN is 14200 + 100M L flops per received sample.
Similarly, we perform the same analysis for the MetNet. The FCL we use requires M L × 2 flops, and our custom activation layer adds an extra M L flops. In total, the MetNet requires 3M L flops per received sample.
For example, with M L = 32, which is what we use in the simulations, the VN needs 17400 flops to calculate the branch metrics for each received sample, while the MetNet needs only 96 flops, which is two orders of magnitude better. For reference, the Euclidean distance (ED), which is just a subtraction and a square, requires 2M L flops, as shown in Table 1. Therefore, the MetNet is almost as low complexity as the traditional ED used in classical VA, but achieves a much higher performance in detecting an FTN sequence under CSI uncertainty, as will be shown in the next section.

V. SIMULATION AND RESULTS
To investigate the performance of the MetNet, Monte Carlo simulation was performed, using BPSK modulation and a 4-tap ISI channel defined by where l = [0, 1, . . . , L c −1] and L c = 4. A decaying factor of γ = 1 was used for most of the results, while the squeezing parameter investigated in this study is mostly for τ = 0.8, this technique would also work under the same ISI conditions without FTN, but as mentioned earlier the scope of this work is for FTN. Moreover, due to the low complexity nature of the network, it fails to perform well under severe dynamic fading, such as Rayleigh fading. We compare the performance of a detector with known channel coefficients and another one with corrupted channel coefficients,h, such that where l denotes the corruption, simulated using a zero-mean Gaussian random variable with variance σ 2 . During corrupted training, the neural network is trained offline on simulated received symbols generated from different instances of corrupted channel taps. In other words, the received samples, r n , in (3) will be simulated usingh l instead of h l . On the other hand, the traditional ED metric calculator uses different instances of corrupted channel taps to compute the metrics, so the metrics, μ n , that are calculated based on (5) will be computed usingh l instead of h l . Fig. 3 shows the results established in [21] of the MetNet against the ViterbiNet (VN) and the Euclidean distance (ED) methods of metric computation at τ = 0.8, σ 2 = 0.1, and γ = 1. This plot shows that when the receiver has noisy channel estimates,h, both of the neural network-based architectures perform significantly better than the conventional Euclidean distance method, which suffers from an error floor. As a benchmark, Fig. 3 also shows the BER for the ideal case when the ED-based metrics are used with perfect CSI. Both NN-based approaches are nearly as good despite having imperfect CSI. Further, the results show that the MetNet produces better results than the VN despite having significantly lower complexity. For example, the bit error rate (BER) of the ViterbiNet, at 8 dB, is 1.9 × 10 −4 , while the MetNet has a BER of 7.5 × 10 −5 , which is very close to the optimal result using the Euclidean distance with perfect CSI to calculate the metrics, which has a BER of 5.5 × 10 −5 at 8 dB. Meanwhile the ED metric with channel uncertainty showed the worst results with a BER of 2 × 10 −2 at 8 dB, which shows that it is not robust to channel uncertainty.

A. METRIC COMPUTATION COMPARISON
There was basic tuning done in [21] to generate Fig. 3, however, the next section will establish a more formal and thorough tuning of key hyperparameters.

B. HYPERPARAMETER TUNING
Two of the most hyperparameter to tune in order to avoid a model that overfits or underfits are the training size and the number of epochs, N eps . Since we established earlier that training benefits from having blocks of sizes M L each passed through a different channel realization, the training size is essentially determined by the number of different channels, N chs we train on, i.e., training size = M L (number of different channels).
Hence, we tune both these parameters by comparing different models generated from different values of both parameters where we have N chs ∈ {10, 20, . . . , 300} and N eps ∈ {10, 20, . . . , 300}. Fig. 4 shows a 3D plot showing the BER for each combination of N chs and N eps , this plot is shown for models trained at τ = 0.8, γ = 1 and σ 2 = 0.1. From this plot, we can establish that there is a trough that is lower than the rest, and we can also see that at some point, increasing the number of epochs or different channels results in diminishing returns. We can break this 3D plot down to two 2D plots showing the effects of N chs and N eps separately. Fig. 5(a) shows the BER vs the number of epochs, where each curve shows a different number of channel realizations trained on, and similarly Fig. 5(b) shows the BER vs the number of channel realizations trained on, where each curve shows a different number of epochs trained with. Fig. 5(a) shows that for almost all number of epochs that are high enough, training on 100 different channel realizations yields the minimum BER, similarly Fig. 5(b) achieves the minimum BER at 50 epochs. We note that the difference between the ideal (minimum BER) and other combinations of these parameters is not very significant, however, at 100 different channels with 50 epochs we found that we achieve the best results at lower complexity than other results. While one might assume that training at more channel realizations should result in better training, it could be the case that once we go higher than 100, the model starts saturating and perhaps overfits a little, which would explain the trend of slight performance dip as the number of channels trained on goes higher. The final parameters used are summarized in Table 2.

C. TRAINING WITHOUT NOISE
The results in Fig. 3 are generated from a system that is trained offline with a separate model at each SNR. Simulation results shows that the MetNet exhibits SNRresiliency results that can be extrapolated to really high SNR values as shown in Fig. 6, these results show that training at very high SNR's such as 100, there is no loss in performance, essentially at such high SNR, this is equivalent to training with no noise. This approach is works because of the presence of error variance, and would not work for very low or zero error variance, as will be shown by some results shortly.
We configure the training bits to be a vector of just M L elements, corresponding to all the possible transmitted sequence bits, this block is repeated N times, where N is the number of different channel realizations (taps) that we decide to train on. With the previous approach, each block size was greater than M L , additionally, the bits in the block were randomly generated without ensuring that each block represented all M L possible sequences.
We investigate different training approaches, including training with and without noise, as well as training with  randomly generated bits blocks and with blocks of M L corresponding to all possible transmitted sequences. Fig. 7 shows the results of each approach, we found that the best approach to use was training with no noise, with training blocks of size M L , which achieves a BER 5.5 × 10 −4 , very close to optimal result.

D. PERFORMANCE AT HIGH ERROR VARIANCE
As stated earlier, training without noise works because of the presence of sufficient error variance, which adds the necessary randomness in our training dataset to generalize well even without AWGN noise, this is illustrated in this section.
The MetNet remains a better alternative for metric calculation than using Euclidean distance and the ViterbiNet, even as the error variance gets more severe, as shown by Fig. 8, which shows the BER performance of the MetNet, the ViterbiNet, and Euclidean distance over a range of different error variances. The graph demonstrates that the MetNet exhibits only a very slow performance degradation as the error variance increases, whereas the ViterbiNet degrades much more rapidly. These results are for τ = 0.8, however the trend is the same regardless of τ .
Another important observation is the poor performance of the MetNet at very low error variance, this is due to the fact that when trained without noise, as well as low error variance, there is very little variance in the training set, this results in overfitting and degrades the BER performance on the test set. In fact, having sufficient error variance in the first place is what allows the system to perform very well with no noise during training.
The next section dives deeper into the robustness of our NN, and tests its resilience to being trained and tested at mismatched accelerating parameters τ a and decaying factors γ .

E. ROBUSTNESS OF METNET
In addition to the SNR invariance demonstrated by the MetNet, as well as its ability to perform well at high error variance, the network also exhibits some τ -resiliency as well.  For example, a network trained at τ = 1 still performs relatively well at τ = 0.9, but that does not remain true for τ = 0.8, as shown by Table 3. These results are generated at 8 dB with CSI uncertainty σ 2 = 0.1.
Moreover, the MetNet shows some resiliency to being trained at mismatched decaying factors, γ . Fig. 9 demonstrates these results for a network trained at γ = 1, τ = 0.8, σ 2 = 0.1 and SNR = 8 dB, the results suggests that when training at a lower decaying factor (more severe ISI), the MetNet can still perform relatively well on higher decaying factors (less ISI), in this particular example, for a network trained at γ = 1, the results at γ = 1.2 were still within a close BER. However, when testing at lower decay values than the one trained at, the BER drop-off is more severe and noticeable, as we can see from the Fig. 9, the result at γ = 0.8 had a much more steep BER performance drop than at γ = 1.2. Therefore, it is important to also observe the results of a network tested at γ = 1, while being trained on mismatched γ -values. Fig. 10 shows the results of a network trained at different γ -values, at τ = 0.8, σ 2 = 0.1 and SNR = 8 dB, and tested at γ = 1, we notice the same trend as in Fig. 9.
Building a network that has even stronger γ -resiliency is desirable, as that mean that the network is able to work better in more varied channel conditions. We build several other networks, that are trained on a dataset generated from different realizations of γ .
Next, we test the robustness of each network when it's trained on a range of γ = X ± 0.1, and tested at γ = X. Fig. 11(a) compares the baseline results for a network trained and tested at the same γ values (red), to the results when we train on γ = X ± 0.1, and tested at γ = X, Fig. 11(b) repeats the same experiment but the models are tested on γ = X ± 0.2 instead, with τ = 0.8, σ 2 = 0.1 and SNR = 8 dB. These results show that we can afford training over a range without much of a BER loss, for lower γ -values, a smaller training of range of γ ± 0.1 is preferred, however at higher γ -values we achieve good results when training over a slightly bigger range of γ ± 0.2, meaning we can afford to train even less models at higher γ -values.
We can also test the baseline performance (which trains at every γ -value tested on), with 3 networks, trained over the following ranges net 1 ∈ [0.7, 0.  1.2, 1.3, . . . , 1.6], both are tested over a sweep of a γ range. Fig. 12 shows these results against the baseline established earlier with τ = 0.8, σ 2 = 0.1 and SNR = 8 dB. These results show we can establish an adaptive system, with way less models trained, each model valid for a certain range of decaying factors γ . We can also see again that when operating at lower γ values, it is better to stick with a smaller range, as shown by Fig. 12(a), while at higher γ -values, we can one model that works well over a range of 5 values of gamma ∈ [1.2, 1.3, . . . , 1.5] as shown by Fig. 12(b), instead of training one for each realization.
The approach of training over a range can really shine in a system limited to having only a certain number of models. For instance, if we are limited to having just 5 models, we investigate the best way of training these models. Method 1 would be training at 5 discrete γ values, and Method 2 would be training over 5 different ranges of γ . Fig. 13 shows the result of both methods, each network's result is shown within its region of operation, with τ = 0.8, σ 2 = 0.1 and SNR = 8 dB. For Method 1, the 5 models are trained over the following values of γ ∈ [0.4, 0.7, 1, 1.3, 1.6] and for Method 2, the 5 models are trained over the following uniformly distributed ranges net 1 ∈ [0.2, 0.3, . . . , 0.55], net 2 ∈ [0.55, 0.65, . . . , 0.85],  net 3 ∈ [0.85, 0.95, . . . , 1.15], net 4 ∈ [1.15, 1.25, . . . , 1.45],  Fig. 14(a) shows the results of Method 3, and Fig. 14(b) cascades all methods on the same plot as the baseline. While it might not be clear right away from Fig. 14(b) which method works best, we can compare the mean BER of each method over the testing γ range, and the means are as follows: μ 1 = 1.5 × 10 −4 for Method 1, μ 2 = 1.08 × 10 −4 for Method 2, and μ 3 = 8.9 × 10 −5 for Method 3. Hence we were able to get the best results in the models-limited system using Method 3, which is training over a range of γ 's that are more narrow with lower γ values.
Overall, we can adopt an adaptive system, which enables us to train less models with a slight losses in BER performance that is more obvious when operating at lower decaying factor γ values where ISI is more severe.
These results suggests that training time and the memory requirement can be significantly reduced -instead of training at each SNR, τ and γ realizations, fewer networks trained with no noise and fewer τ and γ values are sufficient and more practical.

F. BENEFITS OF CUSTOM ACTIVATION LAYER
In the MetNet, a custom activation layer that performs f (x) = −|x| 2 was applies to the output of the fully connect layer. Although theoretically the inclusion of this layer may be unnecessary since the network should be able to learn without being guided by model-based knowledge, it With that being established, we now present the results of our tuned system with the activation layer that is trained with no noise, and observe the improvements we can achieve over the results achieved earlier in Fig. 3.

G. PERFORMANCE OF TUNED SYSTEM
In this section, we present the results of the system after the hyperparameter tuning showed earlier, with a training dataset that is simulated without AWGN, as well as with the inclusion of the activation layer. Fig. 16 shows the improved results we can achieve with our MetNet, the result we can achieve with CSI uncertainty is near-perfect, as both our NN and the optimal result using the ED have a BER of 5.5 × 10 −5 at 8 dB.
In the next section we will extend these results and test them over the average of a range of different channel models defined by different decaying factors, to show that we can maintain the performance over different channels.

H. EFFECTS OF DIFFERENT CHANNEL MODELS
We explore the performance in 20 different channel models corresponding to 20 values of gamma and present the average BER, as was done in [14]. We average our results over a sweep of decaying factors over the range γ ∈ { n 10 |n ∈ {1, 2, . . . , 20}}.  optimal BER using a VA with perfect CSI knowledge is 7 × 10 −5 , while the MetNet, with channel corruption has a BER 8 × 10 −4 and the ViterbiNet has a BER of 2 × 10 −4 . Finally, the traditional metric computation method using ED has a much higher BER of 2 × 10 −2 , seemingly approaching an error floor.

I. SPECTRAL EFFICIENCY GAINS
Another interesting result is showing the possible bandwidth gains specific to the proposed neural network-aided detection in the ISI FTN system. Fig. 18 shows the BER curves when using the MetNet architecture, with CSI uncertainty at different values of τ . There is a slight BER performance degradation at τ = 0.75, meaning that this is where the Mazo limit of the system roughly lies. Hence, the bandwidth efficiency improvement for the system using the MetNet under channel uncertainty is around 33%. By comparison, when the CSI is perfectly known and the ED metric is used, it is possible to get a bandwidth efficiency gain of 39%, which is only slightly higher than for the imperfect CSI case.

J. HIGHER ORDER MODULATION
The previous simulations were done using BPSK constellation, the network achieves identical results to Fig. 16 when tested at 4-QAM.
For 16-QAM, the results in Fig. 19 show that the MetNet is still able to recover the transmitted data, albeit with a slight performance penalty. This penalty is perhaps due to the fact that after ISI, the distribution of the received samples is too dense, so the MetNet is unable to always distinguish between different transmitted symbols.

VI. CONCLUSION
In this paper we explored the performance of an FTN system in an ISI channel, using Viterbi detection with different methods of metric computation. We proposed a novel low complexity neural network architecture to calculate metrics, the MetNet, and compare it to another neural network-based metric calculator, the ViterbiNet, as well as the traditional method of calculating metrics using the Euclidean distance. We present the results of all methods when there is CSI uncertainty present in the system.
The results show that under uncertain channel conditions, both neural network approaches show significantly better results than the Euclidean distance-based approach. Additionally, we show through a complexity analysis that the MetNet has very low complexity that approaches the simplicity of Euclidean distance, as well as better performance compared to the ViterbiNet.
We establish that our MetNet performs better when trained without noise, and we tune the network to set the number of channel instances we train our model with as well as the suitable number of epochs. We further explore the resiliency of the MetNet by training it over a range of decaying factors, γ , instead of training a different model at each instance, as well as the resiliency to being trained at mismatched accelerating parameters τ values, these results show that we are able to significantly reduce the number of models trained, and as such we establish an adaptive system that can adopt suitable models based on the channel conditions. The results further show that the MetNet remains the favorable alternative for much more intense CSI uncertainties, where it showed strong robustness and only a slow degradation of BER performance as the error variance increased.
We showed that these results can be generalized for different channel models with different decaying factors, as the trends of the result held true when averaged over a sweep of different γ values. Moreover, we present results that reflect a BER performance enhancement when using our custom activation layer, which makes the output of the network closely represent a Gaussian PDF. Finally, we present the bandwidth efficiency gains due to FTN when using the MetNet, with CSI uncertainty over different decaying factors, the results show that we can have a bandwidth efficiency improvement of around 33%.
There are several directions for potential future work regarding this study. To begin with, this neural network is built on top of the VA, meaning that the it still has the same M L complexity when it comes to the modulation order and the channel length, this could prove to be limiting for a higher order modulation systems. Exploring the potential of this neural network on top of a reduced complexity VA, either through channel shortening [18], reducing the trellis size [17], or other schemes to reduce the M L complexity is an interesting direction. This would enable higher order modulation, where the complex part of the signal could be handled by an identical NN structure, one part for the real component of the signal and another for the complex part, the low-complexity nature of the MetNet would work well with this approach.
Moreover, exploring the application of this NN in a multicarrier FTN system [1], as well as channel coding [9], would allow us to reach much higher spectral efficiency gains. Additionally, due to the low complexity nature of the MetNet, it is not suitable for deep dynamic fading, such as Rayleigh fading. Exploring model enhancements that can handle Rayleigh fading is an interesting direction and is left for future work.
Finally, due to its similar nature to the ViterbiNet, the MetNet has the potential to be utilized in an online-training fashion, similar to the Meta ViterbiNet [20], where the model is able to update in real time based on incoming signals, resulting in a more resilient and dynamic model.