Orthogonal Features Based EEG Signals Denoising Using Fractional and Compressed One-Dimensional CNN Autoencoder

This paper presents a fractional one-dimensional convolutional neural network (CNN) autoencoder for denoising the Electroencephalogram (EEG) signals which often get contaminated with noise during the recording process, mostly due to muscle artifacts (MA), introduced by the movement of muscles. The existing EEG denoising methods make use of decomposition, thresholding and filtering techniques. In the proposed approach, EEG signals are first transformed to orthogonal domain using Tchebichef moments before feeding to the proposed architecture. A new hyper-parameter ( $\alpha $ ) is introduced which refers to the fractional order with respect to which gradients are calculated during back-propagation. It is observed that by tuning $\alpha $ , the quality of the restored signal improves significantly. Motivated by the high usage of portable low energy devices which make use of compressed deep learning architectures, the trainable parameters of the proposed architecture are compressed using randomized singular value decomposition (RSVD) algorithm. The experiments are performed on the standard EEG datasets, namely, Mendeley and Bonn. The study shows that the proposed fractional and compressed architecture performs better than existing state-of-the-art signal denoising methods.


I. INTRODUCTION
E LECTROENCEPHALOGRAM(EEG) is the recording of electrical activity inside the human brain and it is recorded using electrodes which are attached to the human scalp [1], [2].During the recording process of EEG signals, they often get contaminated with various types of artifacts, due to muscle activity, eye movements and heart rhythms, which are measured by electromyogram (EMG), electrooculogram (EOG) electrocardiogram (ECG) signals, respectively [3].Among these, the Electromyogram/muscle artifact (EMG/MA) is one such type of noise that is generally found to be challenging to eliminate, mainly due to its high amplitude and its broad frequency and anatomical distributions [4].
Different approaches have been reported in the existing literature to remove muscle artifacts (MA) from the contaminated EEG signals.Adaptive filters [5], low-pass filters [6], and filter banks [7] are employed for solving the problem of signal denoising.Novel methods like RLS [8], LMS [9], and Kalman Filter [10] have been proposed.Numerous decomposition techniques like wavelet transform [11], empirical mode decomposition (EMD) [12], ensemble empirical mode decomposition (EEMD) [13] have also been employed to achieve good results.
The application of machine learning and deep learning architectures have also been found to be very effective in denoising signals [17].Denoising auto-encoder is one such kind of deep learning architecture that has outperformed existing non-deep learning based denoising methods [18], [19], [20].However, the deep learning approaches do not address the effect of compression on signal denoising as the number of trainable weights used in the architecture increases.This can cause redundancy in the weights used and memory issues, when deployed on low energy devices.In recent years, researchers have proposed several techniques to combat redundancy in the neural network weights.Thresholding techniques like pruning is proposed in [21] to remove the least important trainable weights from the neural network.Another method used for compression calculates the low-rank approximation of weight matrices that simultaneously reduced storage and time complexity during the training and testing phases [22], [23].The concept of randomized singular valued decomposition (RSVD) was introduced in [24], which represented a faster way of calculating low-rank approximations as compared to the singular valued decomposition (SVD).This idea was also explored in [25] for working with large-scale data and was found to be quite effective.In this paper, the RSVD algorithm is used for compressing the trainable weight matrices of the proposed architecture.
Deep learning methods have also been employed with input features transformed to frequency domain for achieving high speed deep learning architectures [26].The discrete cosine transform (DCT) coefficients of the input images are used to represent its important features so that neural networks can learn the image manifold in a better way and yield superior image denoising results [27].Recently, orthogonal moment domain, has also been recently explored to address common problems in image processing [28], [29].One such kind of orthogonal moments, namely, Tchebichef moments (TM) exhibit an essential property of energy compaction, that led to promising results in denoising images [30].This research finding has motivated us to exploit the advantage of feeding these TM based orthogonal features to the proposed onedimensional convolution neural network (CNN) architecture.
Traditional CNN architectures uses integer order calculus for calculating gradients during backpropagation.With fractional calculus now getting popular for solving significant problems in the image processing domain like image denoising [31], [32] and texture enhancement [33].It has also found a place in the neural networks where gradients are calculated using differentiation with respect to a particular fractional order (α) and has given a performance boost in classification problems as compared to conventional neural networks [34].We have also used fractional calculus in the back-propagation phase of the proposed architecture for improved performance in the case of EEG signal denoising.
This paper is organized as follows.Section II provides some mathematical preliminaries that are useful in understanding the underlying mechanics of our architecture.Section III and IV propose the workflow of our fractional one-dimensional CNN and it's compressed form respectively.Experimental results are provided in Section V that contains the details of the datasets used, data preparation for training, performance metrics and evaluation of our proposed model under compression followed by discussion on the results obtained.Section VI concludes this work.

A. Tchebichef Moments (TM)
Let x(n) be an EEG signal with n = 1, 2, . . ., N .The relationship between the noisy signal y(n) and the original signal x(n) corrupted by noise is given as follows: where ζ(n) is the muscle artifacts (MA) noise.This paper proposes a deep learning architecture which recovers an estimate of the original signal from its noisy observation y(n).
The Tchebichef moments of order p for a signal x(n) of length N samples is given by [35]: with p = 0, 1, 2.....N − 1.For simplicity, t p (x) has been used to represent t p (x; N ) which is the orthonormal Tchebichef polynomials given by where The initial conditions for the recurrence relations are and The set of TMs upto order p in matrix form is given as where X = [x(0), x(1), x(2), . . ., x(N − 1)] and Here, Q is the Tchebichef polynomial matrix upto order p.The original one-dimensional signal X can be reconstructed from the set of Tchebichef moments using the following equation

B. Compression using RSVD
The compression of the kernel and weight matrices used in the proposed architecture is carried out using low rank approximation of these matrices.It is done using the RSVD technique, which decomposes the original matrix A ∈ R n×m into a smaller randomized subspace B ∈ R c×m , where c < n.For kernel matrix K ∈ R n×c×f , where n, c and f refer to the number of filters, number of channels and feature dimension respectively, we reshape it into a 2D matrix A of the form R n×m , where m = c * f .For calculating the low rank approximation, let O ∈ R m×(r+p) be a normally distributed random matrix, where r is the rank to be approximated, p denotes the number of additional projections such that r + p < n.We define Q i as the orthogonal basis after i iterations, where i = 1, 2, 3 . . ., k and k denotes the number of subspace iterations.The value of Q 0 is set using the following equation The recurrence relation for calculating the orthogonal basis Q i is given by where qr() is the function for the QR decomposition operation which factorizes a matrix into an orthogonal matrix and an upper triangular matrix.Here, we just take the orthogonal The SVD decomposition of this condensed matrix is where, U r ∈ R n×r , V r ∈ R m×r are the matrices with orthonormal columns and S r ∈ R r×r is a diagonal matrix.

C. Fractional Order Processing
Unlike the integer order derivatives, various definitions have been proposed for fractional order derivatives.The three most commonly used fractional order derivatives are, namely, Grunwald Letnikov (G-L), Riemann-Liouville (R-L) and Caputo derivatives [34].We have used Caputo fractional derivative (CFD) of a function f (x) with order α, defined as follows: where n − 1 < α < n , n ∈ N + , a is the initial value and Γ(•) denotes the Gamma function.The CFD is found to be consistent with the integer order derivatives used in neural networks, because of which this derivative is applied in several engineering problems [34].This motivated us to employ it during the back-propagation of our proposed model.Let α be the fractional order for which the derivative needs to be calculated and f (x) = (x − a) k be a polynomial function of degree k.The Caputo fractional derivative is given by [36]: For simplicity of the notation, the fractional derivative x f (x) and will be used in calculating gradients of the proposed architecture.

III. PROPOSED ARCHITECTURE
The proposed fractional CNN based architecture for denoising EEG signals is shown in Fig. 1.The model is based on the encoder-decoder architecture.The encoding of the EEG signal is carried out and the information is represented in the compressed form as latent vectors.This is followed by up-sampling (decoder) operation that recovers the information present in the EEG signal from the latent space.This can be observed from Fig. 1, where each of the first two convolutional layers followed by average pooling layers constitutes the encoder block while each of the last two convolutional layers followed by up-sampling layers constitutes the decoder block.Here, the up-sampling layers are used for recovering structural details present in the EEG signals.
The original EEG signal fragments are transformed into TMs (orthogonal) space T N (X) using Eq. 8 where X is the original signal fragment of dimension N .Similarly, the transformation of the noisy signal T N (y) has also been done, which is then fed as an input to the first convolutional layer of the proposed architecture.The architecture has four convolutional layers (CONV), two average pooling layers, two upsampling layers and a flattened layer, which is connected to the fully connected (FC) layer.The first convolutional layer (CONV1) has 16 filters.The next two convolutional layers (CONV2, CONV3) have 64 filters each while CONV4 has 16 filters.Rectified linear units (ReLU) have been employed as activation functions for convolutional hidden layers.There are 250 neurons in the fully connected layer (FC).All the convolutional layers are having kernel of dimension 1 × 3 and kernel stride is taken as 1.The average pooling layers have kernel dimension of 2 and a stride of 2, while the up-sampling layers have the up-sampling factor as 2. The padding ensures Fig. 2: Illustration of im2col transformation on 1D input the output dimension to be same as that of the input.Next, the workflow of the architecture that involves forward and proposed fractional backward propagation will be discussed.

A. Forward Propagation
1) Convolutional Layer: The input-output relationship for the convolutional layer of the architecture is given as where m = 0...N Here, representing the number of channels and being the feature dimension.For the i th convolutional layer, we have the trainable kernel , where F [i] is the kernel filter dimension and N

[i]
F denotes the number of kernel filters.The matrix b [i]  denotes the bias for this layer of dimension i] is the output of the convolution layer and is of size W , where the output feature dimension W is given by N with g [i] , denoting the padding size.In this paper, Tchebichef vector of the noisy signal T N (y) is taken as the input features denoted by I [0] (Eq.19).This is fed to the first convolutional layer of the architecture, with For faster implementation of the above approach, we use the method of matrix multiplication to represent convolution operation given in Eq. 19.For this, input I [i−1] is converted into matrix form using the following transformation: where the dimension of Here, im2col refers to the technique in which input of size 1 × F [i] is taken and stack it in the form of columns of a matrix.The pictorial illustration of im2col has been shown in Fig. 2. Now, Eq. ( 19) gets modified as where (*) denotes the matrix multiplication, b [i] is the bias matrix of size (N F × 1).Each convolutional layer output S [i] is fed as an input to the rectified linear unit activation function (ReLU), which gives C [i] as the output governed by the following expression 2) Average pooling Layer: Average pooling operation is applied on the activated feature maps of the i th convolutional layer W .In this study, pooling filter size P is given as where m = 0...N

[i]
F − 1 and k = 0...N [i] 3) Up-sampling Layer: Up-sampling operation increases the resolution of the activated feature maps, i.e, the i th convolutional layer C [i] .The output of the up-sampling layer U [i] is defined as where m = 0...N

[i]
F −1 and j = 0...N as the up-sampling factor U .4) Flattened Layer: Before we feed the features to the FC layer, it needs to be flattened into a one-dimensional format.This process can be represented by the following equation where R is the input for this layer and F is the flattened output which can be used in the FC layer.5) FC Layer: Now, the forward propagation can be represented by where F, W and B denote the output of the flattened layer, weight matrix and bias respectively.Here, TN (x) is the estimated denoised signal which will be used in the formulation of the loss function discussed next.

B. Loss function
The proposed loss function for the fractional CNN autoencoder is given as   (28) where, M represents the number of training samples, E is the number of convolutional layers and λ represents the regularization parameter.For back-propagation, the derivative of loss with respect to TN (x) is calculated as Eq. ( 29) is used during back-propagation process which is discussed next.

C. Fractional Back-Propagation
The proposed back-propagation technique comes with the advantage in the form of an extra hyper-parameter, i.e, fractional order α which can be tuned to obtain the best denoising performance and also plays an important role in training the architecture.
1) FC Layer: The value obtained using Eq. ( 29) is used as input to the FC layer.Using Eqs. ( 27) and ( 28), the fractional gradients of weights, i.e, D α W L is calculated as follows: Using Eq. ( 18), the fractional gradients present on the right hand side of Eq. ( 30) are calculated as follows where, Γ(•) denotes the gamma function.Substituting these values obtained in Eq. ( 30) results in Similarly, the fractional gradient with respect to the bias B is given by 2) Flattened Layer: This layer flattens the input during forward propagation as mentioned in Eq. (26).So during backward propagation, we can reshape the output gradient ∂L ∂F in the shape of R [i] to get the input gradient ∂L ∂R 3) Up-sampling Layer: For backpropagating the gradients, the successive gradients from ∂L ∂U [i] are added and assigned to each element in ∂L ∂C [i] .This can be represented using the following equation: where the up-sampling factor U /2.The gradient ∂L ∂C [i]  for the m th feature map can also be represented in a matrix form as follows where, ∂L refers to the k th element of m th feature map in ∂L ∂U [i] .
4) Average Pooling Layer: Similar to the back-propagation for Up-sampling layer, ∂L ∂C [i] is calculated from the pooling gradient ∂L ∂P [i] .Considering the fact that P [i] is the average pooling output, here the operation will be slightly different.Each element in ∂L ∂P [i] is divided by the pooling size f and proportionally back-propagate the error gradients to the input.This can be represented using the following equation: where, for the m th feature map calculated in Eq. ( 37) can also be represented in a matrix form as follows where, ∂L ∂P refers to the k th element of m th feature map in ∂L ∂P [i] .The dimension of ∂L ∂C where 5) Convolutional Layer: In this layer, backward propagation of the errors is carried out to calculate the fractional gradients for the kernel is used to calculate these gradients.Using Eqs. ( 22) and ( 28) we obtain Here, the gradient ∂L ∂S [i] can be calculated using elementwise multiplication of ∂L ∂C [i] and and is given as Fig. 3: Using Eq. ( 23), the value of ∂S [i] can be obtained as follows Substituting the value of in Eq. ( 41), results in the following expression Using Eqs. ( 22) and ( 18), the individual terms of Eq. (39) can be written in the following way Substituting the above results in Eq. (39) the fractional kernel gradient is given as Similarly, using Eqs.( 41) and (43) the fractional gradient with respect to the bias is given as Next, for back-propagating the errors to the previous layers such as average pooling or up-sampling layer, the input gradient ∂L ∂I [i−1] needs to be calculated.For this, first we need to calculate the gradient ∂L ∂I . It's value can be obtained using Eq. ( 22) and is given as with the dimension being . Here, the inverse transformation col2im (Fig. 3) is operated on ∂L ∂I depending on whether the previous layer is an average pooling layer or the up-sampling layer.Once the gradients of all the trainable parameters are obtained, parameter update is carried out using gradient descent of learning rate η as follows

IV. COMPRESSED ARCHITECTURE
The flowchart for the compressed version of the fractional based CNN architecture is shown in Fig. 4. The training phase consists of forward and backward propagation of the fractional architecture discussed in Sec.III.This is followed by updating the trainable weights using fractional gradient descent.Next, the compression of the trained weights is carried out using RSVD function discussed in algorithm 1.The inputs required for the calculation of RSVD are θ and r.Here, r denotes the rank of the matrix whereas θ ∈ {K [i] , W } denotes the trainable set consisting of kernels K [i] , which are used in convolution layers (CONV) and W is the interconnection weight between the flattened and FC layers.The optimized rank (opt) is calculated as the rank at which 90% of variance is covered for the singular vectors obtained using SVD decomposition given in Eq. ( 15).This is carried out using check optimized rank.The above description about the compression procedure carried out during the training  During the testing phase shown in Fig. 4, the architecture has the optimized trainable parameter θ opt , obtained using algorithm 1. Next, the compression of the θ opt based on the compression rate (C R ) is performed using algorithm 2 resulting in θ c which is a rank r approximation of the θ opt .Finally, the compressed parameter θ c is used for denoising the EEG signals.The process is summarized in algorithm 2. Here, the compression rate C R is varied from 5% to 95% for our observations and this is done for various values of fractional order α ranging from 1 to 1.5.

V. EXPERIMENTS AND EVALUATION
In this section, several experiments are conducted to validate the efficiency of our proposed architecture.Firstly the standard datasets on which evaluations are conducted is presented, followed by experimental results including comparison with existing MA removal methods.Next, a detail study in which the performance of the architecture after compressing the kernel weights using low rank approximation is examined.All the experiments are performed TESLA K80 GPU.

EEG Datasets
Now, the architecture on two publicly available databases, i.e., Mendeley

B. Data Preparation and Performance Metrics
For Mendeley database, we took 1026 signals, each of 2000 samples, from which the training and testing data were created after splitting the data into 80% training and 20% testing set.Bonn database has five subjects each having 100 signals.We take those 100 signals and make a 80%-20% train-test split across all the subjects, i.e., 80 signals are taken from each of the five subjects for training, while remaining 20 for testing.Accordingly, the contaminated signals are generated by randomly mixing EMG signals with the original EEG ones.For comparison with existing MA removal methods, we take the subject 'Z' from Bonn database (represented by Bonn(Z)) for comparison while the whole testing set is taken from Mendeley database.
The next step involves creating fragments of 250 samples from both the noisy and original EEG signals.The number of signals after taking each fragments is just 8 times more, so it is not enough for training any deep neural network.To combat this, we perform data augmentation by randomly choosing a point from a particular signal to take 250 fragments and repeat this process for few number of iterations such that finally, we have around 20000 fragments for training.All these fragments are then transformed into orthogonal domain using TMs.After normalizing the fragments using standardscalar function from sklearn [37] library, the resulting noisy Tchebichef vectors are fed as an input to the proposed architecture.
The evaluation of the proposed architecture is performed using different performance metrics such as signal-to-noise where x(n) is the original signal, x(n) is the reconstructed signal, M x and M x denote the mean of x(n) and x(n) respectively, and N is the total number of samples.

C. Denoising Performance
To validate the denoising performance of the architecture, a comparative analysis is carried out with the existing MA removal methods that have given good results on both the databases.It can be observed from Table I, that the performance metrics for the proposed fractional CNN architecture outperforms all of the existing methods.Our model gives an improvement of 8.5% and 12.7% in SNR values for Mendeley and Bonn(Z), respectively, when compared to the recently introduced variational mode decomposition (VMD) method.
The optimal hyper-parameters for the convolutional neural network were selected only after parameter tuning.Training and testing data loss was monitored after each epoch to check the condition of over-fitting.It was found that with batch size of 64, learning rate (η) of 0.0005, regularization parameter (λ) of 0.00001 and training for 300 epochs was found to be optimal for Mendeley database.For the Bonn database, training is performed using 200 epochs while the other hyperparameters remains the same.
Apart from the standard hyper-parameters listed in the aforementioned paragraph, the proposed architecture provides an extra hyper-parameter α that can be tuned to boost the denoising performance.The fractional order (α) is used in calculating the weight gradients in back-propagation discussed in Sec.III-C.Experiments are conducted, where α value is varied from 1 to 1.6 in steps of 0.1 and the optimal one is taken for final evaluation.The performance metrics are represented in Table II after

D. Compression Analysis
In this section, denoising results are presented when the fractional auto-encoder is trained with RSVD compression carried out on trainable weights (see Sect.IV).The optimized rank for the weights matrix in FC layer and kernel matrix in CONV2 and CONV3 layer is selected such that 90% of singular values can be retained.Table III shows the original and the  optimized rank for CONV2, CONV3 and FC layers calculated using algorithm 1.It can be seen that for Bonn(Z) database, the optimized rank for FC layer is higher as compared to that of Mendeley database, which signifies that this layer will be more sensitive to compression for Bonn(Z) database.
The CONV2, CONV3 and FC layers are individually compressed using the optimized ranks obtained in Table III and the performance of the model is evaluated using algorithm 2. Here, CONV1 layer is not compressed as it is directly interacting with the input.Fig. 7 shows the effect of layer-wise compression on denoising performance evaluated in terms of SNR using the proposed model at various fractional orders.It can be observed that conventional CNN auto-encoder network shown in Fig. 7(a) with α = 1 outperforms the existing stateof-the-art methods if CONV2 layer is compressed up-to 25%.However, as shown in Fig. 7(c), the proposed architecture at α = 1.2 gives higher SNR compared to the conventional CNN even when the CONV2 layer is compressed by 30%.Likewise for FC layer compression, the architecture at α = 1.2 gives better SNR values compared to other state of the art methods even at 60% compression, while the conventional CNN outperforms other methods only upto 35% compression.Moreover, the proposed architecture at α = 1.5 (Fig. 7(f)) gives good performance under compression upto 55% and 35% of FC and CONV2, respectively.Similar analysis is for the Bonn(Z) database and the results are shown in Fig. 8.It can be observed that the best SNR is obtained compared to the existing methods at α = 1.2.This performance is guaranteed even when the CONV2, CONV3 and FC layers are compressed by 25%.For this database, it can be seen that F C layer is more sensitive to compression.This is because the optimized rank (see Table III) is 15% of the original rank, as a result the compression at early stages leads to drop in the SNR performance.This is not in the case of Mendeley database as optimized rank is 25% of the original rank.For further qualitative analysis, Fig. 9 shows the denoised signals produced by the architecture at α = 1.2 when CONV2 layer is subjected to various compression rates.It can be seen that even at 50% compression the reconstructed signal resembles the original one at most of the points.Lastly, two things can be concluded from this study.Firstly, using hyper-parameter α can boost the SNR values of the EEG signals significantly.Secondly, by compressing the weights of the architecture does not degrade the SNR performance too much and its still better than the existing methods.Compressing the architecture provides an advantage that it consumes less memory space and is suitable for edge computing devices.

VI. CONCLUSION
A fractional and compressed one-dimensional CNN autoencoder, which uses orthogonal features in the form of Tchebichef moments has been proposed.The proposed method gives superior results in denoising MA contaminated signals when compared to existing MA removal methods with the best result observed at α = 1.2.Moreover, increasing the compression ratio (C R ) for the weights of the architecture, it beats the existing methods when evaluated using the performance metrics.Another important observation is that a compressed fractional architecture at α = 1.2 performs better than the conventional CNN auto-encoder without compression, i.e., with 60% compression of FC layer for Mendeley database and 30% compression of the CONV2 layer for the Bonn(Z) database.Compressing the architecture not only makes its occupies less memory foot-print but also delivers superior performance compared to other methods.Qualitative analysis of the signals is presented at various stages of compression, which showed that the denoised signals are close to the original signals.Future work involves deployment of the compressed architecture on portable low energy devices.

A. Convergence Analysis
A general fractional gradient descent update can be written as follows where k denotes the iteration.Using Eq. ( 18) the above equations can be re-written as where d = η 1 Γ(2−α) inf k>E ∂L ∂Z k .Assuming there is a such that 2 < d 1/α , then From Eq. (66), we can write Combining Eqs.65 and 67 we obtain Above equation implies that Z k is not convergent.It contradicts the assumption that Z k is convergent to Z and thus the proof is completed.
Theorem 2: The FC layer updated by the fractional gradient descent performed using Eqs.( 51) and ( 53) are convergent to real extreme point From Eq. (51), update equation for each element of weight matrix W can be given by The remaining part is similar to the proof of Theorem 1. > 0 (72) where d = η 1 Γ(2−α) inf k>E ∂L ∂W ij(k) .Assuming there is a such that 2 < d 1/α , then (75) From Eq. 75, we can write Combining Eq. 74 and 76, Above equation implies that W ij(k) is not convergent.Similarly, same result can be obtained for bias B i(k) .It contradicts the assumption that W ij(k) is convergent to W ij and thus the proof is completed.
Theorem 3: The CONV layers updated by the fractional gradient descent method Eqs.(50) and (52) are convergent to real extreme point The update equations given in Eqs.(50) and (52) can be written in the form as mentioned in Eqs.(69)-(70).After that the proof resembles Theorem 2.

Fig. 5 :Fig. 6 :
Fig. 5: EEG signals visualizations for Mendeley database.(A) Spatial domain: (a) Original signal (b) Noisy signal (c) Denoised signal; (B) Tchebichef moment: (d)-(f) Corresponding signals in Tchebichef domain calculating the average over all the test data.It can be seen that for both the Mendeley and Bonn(Z) database, α = 1.2 gives the best denoising results.Compared to the traditional integer order CNN with α = 1, fractional CNN with α = 1.2 gives around 7% and 4% performance boost in denoising for Mendeley and Bonn database respectively.The original, noisy and the denoised EEG signals in spatial domain are visualized on left side of the Figs.(5) and (6) for Mendeley and Bonn(Z) database, respectively.The corresponding signals in Tchebichef domain are plotted on the right side.It can be observed from Figs. 5-6(d)-(f) that by transforming signals into orthogonal space using TMs exhibits sparse behaviour.These sparse signals are used as input feature vectors that helps in accelerating the training process as the architecture now requires fewer number of input coefficients to work upon.

) Theorem 1 :
If the fractional gradient descent method (59) is convergent, then it converges to a real extreme point z * We can prove it by contradiction.Let's assume Z k converges to a different point Z = z * , thus lim k→∞ |Z k − Z| = 0 (62) Let be any sufficiently small value.Then for any , there exists a sufficiently large value E ∈ N such that |Z k−1 − Z| < < |z * − Z| for any k − 1 > E .On basis of assumption, |Z k − Z| < .Then the following equation must hold, Let's assume W ij(k) converges to a different point W ij = W * ij ,where W * ij is the real extreme point.Thuslim k→∞ |W ij(k) − W ij | = 0(71)Let be any sufficiently small value.Then for any , there exists a sufficiently large valueE ∈ N such that |W ij(k−1) − W ij | < < |W * ij − W ij | for any k − 1 > E .On basis of assumption, |W ij(k−1) − W ij | < .Then the following equation must hold, inf k>E ∂L ∂W ij(k)

Tchebichef moments of denoised fragments Tchebichef moments of original fragments CONV + RELU 1D Average Pooling Upsampling1D Flatten Fully Connected layer Reshape Original signal fragments 1x250 Tchebichef Transform Inverse Tchebichef Transform Denoised signal fragments 1x250
and epileptic Bonn database.Mendeley database contains clean EEG recordings of 40 subjects, each having 19 channels and sampled at 200 Hz.The epileptic Bonn database contains five different sets of databases, each representing a particular subject.The subjects Z and O contains EEG recordings of five healthy subjects with eyes open and closed, respectively, subjects N and F contain inter-ictal recordings from seizure patients and S represents the seizure-EEG signals.These EEG signals are sampled at 173.61 Hz.For creating MA-contaminated EEG signals, we take the help from examples of electromyograms database, which has clean EMG signals recorded from healthy subjects, and patients with myopathy and neuropathy.The noisy signals are created by randomly mixing the clean EEG signals with EMG signals after re-sampling at 200 Hz.

TABLE I
COMPARISON OF PROPOSED ARCHITECTURE WITH EXISTING MA REMOVAL METHODS ON MENDELEY AND BONN(Z) DATABASE

TABLE III LAYER
WISE OPTIMIZED RANKS AFTER TRAINING THE PROPOSED ARCHITECTURE WITH RSVD COMPRESSION FOR MENDELEY AND BONN(Z) DATABASE