A Deep Neural Network-Based Spike Sorting With Improved Channel Selection and Artefact Removal

In order to implement highly efficient brain-machine interface (BMI) systems, high-channel count sensing is often used to record extracellular action potentials. However, the extracellular recordings are typically severely contaminated by artefacts and various noise sources, rendering the separation of multi-unit neural recordings an immensely challenging task. Removing artefact and noise from neural events can improve the spike sorting performance and classification accuracy. This paper presents a deep learning technique called deep spike detection (DSD) with a strong learning ability of high-dimensional vectors for neural channel selection and artefacts removal from the selected neural channel. The proposed method significantly improves spike detection compared to the conventional methods by sequentially diminishing the noise level and discarding the active artefacts in the recording channels. The simulated and experimental results show that there is considerably better performance when the extracellular raw recordings are cleaned prior to assigning individual spikes to the neurons that generated them. The DSD achieves an overall classification accuracy of 91.53% and outperformes Wave_clus by 3.38% on the simulated dataset with various noise levels and artefacts.


I. INTRODUCTION
Extracellular recordings have been widely used to monitor neuronal activity by implanting multi-electrodes in the cortex and capturing multi-dimensional neural data. The captured data are a mixture of neuronal activities. A processing step, known as spike sorting, is necessary to separate the multi-unit activities and assign the captured spikes to their originating neurons [1], [2]. Spike sorting is an invaluable research tool applied in brain-machine interface (BMI) research for studying and decoding neural signals arising from the implanted electrodes and understanding the mechanisms of the brain [3], [4]. It is also essential for deciphering intentions The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . from brain activity in BMIs [5] and improving patient control of prostheses via devices [6], [7].
Due to the negative impacts of numerous noise sources [8] and captured artefacts on the recorded neural data, the performance of the spike sorting processing pipeline is often degraded [9]. In multiple channel recordings with microelectrode arrays (MEAs) [10], [11], a considerable number of channels record pure noise activities [12], whereas others record a substantial amount of noise with neural events and artefacts. Neurons far from the electrode tips are seen as artefacts, which represent spurious neural events from distant active neurons and have a significant negative impact on the recorded signal and consequently on spike sorting performance.
Spike sorting is more challenging than clustering in most other domains due to the above mentioned drawbacks. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ It is difficult to extract the underlying and discriminative spike features for assigning individual spikes to their originating neurons from combinations of multiple complex signals and variations in waveform shapes. Attempts have previously been made to eliminate common-noise artefacts in recordings, [13], [14], [15], [16], [17], [18]. There is no simple noise model [19], [20].
In most applications, a clean signal is necessary to decipher the neural recordings accurately. Therefore, improving the performance and accuracy of the spike sorting algorithm requires refining the raw data. However, identifying channels containing active neurons and recognizing the authentic spike waveforms from a stream of spikes is very challenging [21].
Deep learning-based algorithms have proven to be highly effective in a variety of domains and applications, including brain decoding and classification. For example, in [22], the principal component analysis network (PCANet) deep learning technique is used to simplify the computationally intensive nature of traditional spike sorting algorithms. In [21], an approach based on deep learning was used to extract online invasive BMI application feature vectors.
In [23], MEA recordings of different action potentials were classified using a supervised deep learning model comprised of long short-term memory (LSTM) and a convolutional neural network (CNN). An efficient unsupervised form of deep learning-based spike sorting with manual labeling is presented in [24]. In [25], a CNN was employed to distinguish between categories of spikes. An L 2 -normalized deep convolutional autoencoder (CAE) with spike sorting-aware loss was exploited for feature extraction for fully unsupervised and online spike sorting [26]. In [27], a supervised deep learning was used to distinguish spike events from nonneural events, together with deep learning for offline spike sorting [28].
Deep learning-based algorithms are utilized to denoise neural signals effectively. In [29], a deep encoder-decoder network was used to denoise two-dimensional (2D) images of neuronal physiology, resulting in a significant improvement in the distribution of the signal-to-noise ratio (SNR). CAE have been shown to be superior to the conventional band-pass technique in electroencephalogram (EEG) filtering [30]. In [31], CNN and LSTM networks outperformed conventional wavelet approaches in denoising electrocardiograms (ECG). In [32], an EEG denoising model for raw waveform-based data using a residual convolutional neural network (1D-ResCNN) was introduced and, in an end-to-end manner, utilized to map a noisy EEG signal to a clean EEG signal capable of enhancing the SNR significantly and producing cleaner waveforms. Yang et al. [33] removed ocular artefacts (OA) from EEG recordings utilizing a deep learning network (DLN). There is both an online and an offline component to the method. After removing OAs from the training data, a DLN was trained offline to reconstruct the EEG signals. During the online phase, the DLN model acted as a filter to automatically remove OAs from the corrupted EEG data. In [34], a MATLAB-based open-source toolbox that uses FIGURE 1. Diagram depicting different stages of the proposed spike sorting algorithm, including the deep spike detection (DSD). The end-to-end method consists of three main components: First, the DSD which consists of two phases for both the accurate selection of neural channels and artefact removal from the selected channels, yielding cleaned neural data for the feature extraction stage and subsequently classification of the extracted features utilizing K-means clustering. machine learning strategies based on neural networks to label and train models for detecting artefacts in invasive neuronal signals. The authors of [35] use CNN to select channels, after which multiple classes of motor imagery intentions are decoded. In [36], EEG decoding and visualization utilizing deep learning were exploited. It is expected that a convolutional neural network will outperform other methods in these well-established contexts for tasks involving neural data channel selection and artefact removal.
This paper introduces a deep learning-based spike sorting termed deep spike detection (DSD) with improved spike detection accuracy. As shown in Figure 1, it embeds two convolutional neural networks into the conventional spike processing pipeline for the selection of the active neural channels and the removal of artefacts from the selected channels. The proposed method utilizes deep two-dimensional (2D) and one-dimensional (1D) CNNs. The DSD extracts discriminative spike and artefact features from the input channel to exclusively remove artefacts from extracellular recordings. A feature vector is constructed by concatenating a batch of waveforms with a length equal to the number of samples to produce the input vector. A batch of waveforms is classified as a neural channel by the channel selection if at least one waveform in the batch represents a spike. In contrast, it only qualifies a batch as an artefact if all waveforms inside the batch are artefacts. The artefact removal also determines whether a batch contains spikes and discards artefact events if so. By removing artefact channels and artefacts, the spike sorting performance increases since the classification algorithm only processes neural channels containing clean spike events. After removing the artefacts, the active neurons are sorted using principal component analysis (PCA) feature vectors and the K-means classification algorithm. The main contributions of this paper are summarized as follows: • Embedding deep learning algorithms into spike processing pipeline to select active neural channels and extract discriminative spike waveforms.
• Complete removal of artefacts from the selected channels utilizing an optimized complementary CNN model with deep spike waveform identification capability based on morphological characteristics.
• Enhanced feature learning for efficient hardware implementation to improve classification performance. The rest of the paper is structured as follows: Section II describes the proposed DSD algorithm and methodology. The results are presented in Section III, followed by a discussion in Section IV. Finally, Section V makes some concluding remarks.

A. OVERVIEW OF THE ALGORITHM
The proposed algorithm comprises three processing steps, as shown in Figure 1. The first step is a two-stage DSD algorithm, which is followed by a PCA feature extractor that also reduces the dimensionality of cleaned neural events by retaining only few features. The extracted features of events representing neural data are subsequently classified using the K-means clustering algorithm.
The DSD algorithm is a hierarchical structure consisting of two phases capable of accurately detecting spikes in extracellular recordings. The hierarchical structure is constructed by combining two designed CNN-based networks to accurately select neural channels and detect spikes in the selected neural channels, respectively. The performance of the proposed method was evaluated using both simulated and experimental datasets.

B. DATASETS
The effectiveness of the proposed method was evaluated using the simulated labeled dataset made available by [37]. The dataset consists of twenty subsets with disparate levels of noise and similarity between the spikes. Depending on the degree of similarity between the spikes, the subset data is classified into four types of difficulty, namely Easy1, Easy2, Difficult1, and Difficult2. Four noise levels of standard deviation are contained within each type of difficult data (0.05, 0.10, 0.15, and 0.20). The datasets are available with spike times, associated labels, and degrees of overlap between spikes. Using the ground-truth spike times stated in each dataset, 48 sampling points of spikes were extracted, making them appropriate for testing the proposed DSD algorithm and classification performance (see Section III-C).
In this study, both non-human primates (NHPs) and human patients' experimental datasets were used. The human experimental datasets were obtained from [38] as well as [39], [40], where micro-wires and Utah arrays were used to record from human patients. Two sets of micro-wires were inserted into the hippocampus of two human patients, and two Utah arrays, each containing 100 electrodes placed in a 10 × 10 grid, were inserted into the posterior parietal cortex of two other patients.
The NHPs dataset was obtained from collaborative research in computational neuroscience (CRCNS) [41], [42]. Implanted Utah arrays were used to record data from a macaque monkey's primary visual cortex. In addition to the CRCNS dataset, publicly available datasets captured from two rhesus macaques (X and B) using single micro-electrodes were also used. All datasets contain 48 sampled, preprocessed, labeled events since deep learning models require labeled data for supervised learning tasks for training and validation. Table 1 provides detailed information on the experimental datasets.
The datasets were divided into a training set and a testing set. The training set contains 70% of the data needed to optimize the parameters and the hyperparameters of the proposed DSD algorithm. The testing set, which is made up of the remaining 30%, was used to evaluate the generalization capability of the DSD performance on unseen data. In order to prevent biases during training, a more balanced dataset was created by subsampling the data of both classes to generalize well to the distribution. Generalizing well essentially means being able to learn factors inherent to that distribution so that one can perform well on any data sampled from that distribution. The training samples of neural events are almost twice as many as the training examples of artefact events. Consequently, to avoid bias during training, an equal number of samples were randomly chosen from each class. Furthermore, validation data were used to avoid overfitting by using early stopping criteria.
Since labeled data are required for supervised learning and classification tasks, ground truth labeling is obtained by employing KiloSort [8]. KiloSort is a MATLAB-based offline automated spike sorting algorithm that effectively clusters multi-channel neural spike signals based on the geometric layout of the electrode array. KiloSort is optimally used with Phy, an open-source manual clustering Python library with a graphical interface designed to improve the manual refinement of automatic spike sorting. Adjustments are made, which are mainly focused on channel selection attributes and the sensitivity of neuron identification. These are based on each channel's visualization of the overlaid waveform clusters to assess how many clusters are present, the interspike interval distribution of any potential single spike, and whether each potential single unit is stationary (i.e., present for the majority of the recording session). The Phy is only used to manually curate Kilosort's automatic spike sorting, merging (or splitting) to combine (or to divide) two clusters in order to unify spikes from the same neuron or separate spikes from distinct neurons, and labeling. Clusters that contain only artefact events are labelled as ''artefact'', while spike events are labelled as ''spike.'' FIGURE 2. The CNN model architecture utilized for channel selection. The neural data is successively processed to identify the neural and artefact channels. Input is convolved using trained kernel functions in order to get informative feature maps. The convolved feature maps are downsampled by the pooling layer, and the output decision is generated by the Softmax classifier. The corresponding sizes are annotated.

C. CHANNEL SELECTION
Channel selection is the first phase of the DSD. In this paper, the channel selection phase is a deep learning algorithm that uses the CNN [43] algorithm to track and select neural data recording channels while discarding artefact channels. The number of channels varies according to the probe used to record extracellular recordings. The channel selection network architecture is illustrated in Figure 2.
Given a dataset of neural signal separated into labeled trials, where N i denotes the total number of recorded labeled trials and x i represent the i th feature vector and y i the corresponding class label, the goal is to train a neural network, on D i such that the output of the parametric decoder is used to assign the correct label y i to each raw data feature vector x i by learning the parameters θ, iteratively from the training data, i.e the i th sample is a neural channel if y i = 1 and y i = 0 is an artefact channel. The θ represents the parameters of the channel selection CNN while loss is measured by the cross-entropy with b and w are the height and width respectively. Since the channel selection CNN operates on batches of data, the batch size b used is 20.
In this study, the complete dataset consists of 32,235,500 labeled waveforms that yield 1,611,775 labeled input feature vectors. The input is 48 × 20 neural data in 2D matrix form, denoted as x above, where 48 is the sample waveform segment for a time duration of 1.6 ms. The construction of a single feature vector, denoted by x, is accomplished by concatenating a batch of b waveforms of length w, which ultimately results in the creation of a feature vector x i ∈ R b×w .
The feature vector x is labeled a neural channel if one of the concatenated events is identified as a spike activity.
Alternatively, x is labeled as an artefact channel if all the concatenated events constitute artefact activities ( Figure 2). The goal is not to determine how many or which waveforms in the batch contain spikes, but rather to determine which batch of waveforms yields likelihood of containing a spike event.
The designed channel selection CNN model is shown in Figure 2. It takes each feature vector x as described above and processes it through the neural network architecture shown in Figure 2, which consists of four convolutional layers, three pooling layers, a fully connected layer with one hidden layer, and the classifier layer is a softmax layer with a neural channel and an artefact channel as output. The details of the CNN layers used in channel selection are also annotated in Figure 2.
During forward propagation, the convolution operation is performed by sliding each filter across the width and height of x of the input 2D data with a stride of 1. This yields 2D convolved feature maps, which are subjected to nonlinear activation maps for additional processing. The activation function employed is rectified linear units (ReLUs), Two sequential convolutional layers (conv1 and conv2) with a kernel size of 3 × 1 and strides of 1 × 1 extracted the temporal and spatial features, respectively. The subsequent convolutional layers (conv3 and conv4) employ time-based convolution. Fully connected (FC) layers were the final two layers. The first two FC feature maps include 500 and 100 neurons, respectively. The last layer is fully connected to two outputs i.e. neural or artefact channel. These kernel sizes were selected in accordance with the spatial extent of singleunit activities. For an illustration of how many and what size filters make up each convolutional layer, see Figure 2.
With the exception of the first convolutional layer, a pooling layer followed each convolutional layer. Max pooling layers are added to reduce the dimensionality and extract the most complex features from the convolved feature map. Prior to executing downsampling, zero-padding of convolutional layer 3 and convolutional layer 4 was chosen across the width without data change.
Batch normalization is chosen as a regularization technique to improve classification accuracy. The normalization method standardizes each layer of the network input of the CNN for each batch of data to a mean value of 0 and a variance of 1. Dropout [44] is another regularization method, in which some input neurons' values are randomly set to zero. Finally, the cross-entropy cost function is modified by including an L 2 regularization term, which ensures that all weight parameters have small values to avoid a single weight parameter from dominating the classification decision.
The optimization problem is solved using mini-batch gradient descent with momentum (Mini-batch SGDM), a popular optimization technique for updating the biases and weights. For the model learning, Mini-batch SGDM was used for the iterative learning, with a learning rate starting at 0.1 and being tuned piecewise, decreasing by 10 for every 5 training epochs. The Mini-batch SGDM was chosen to be 0.90. With a step size of 0.2, grid search was performed from 0 to 5 using L 2 regularization and the optimal was found to be 1.8. An early stopping criterion was employed to prevent overfitting on the testing data at each epoch by monitoring the validation error. If the error increases or remains unchanged over six successive epochs, the training is terminated. Finally, dropout regularization was used to prevent overfitting by deciding whether or not to discard an input neuron with a probability of 0.5. The trained network assigns one of two labels, as neural and artefact, to an input feature vector.

D. ARTEFACT REMOVAL
The channel selection phase identifies the channels that record neural events. Additionally, these channels record a substantial amount of noise and artefacts. It is impossible to get rid of every possible source of artefacts from recordings. Therefore, it is absolutely necessary to distinguish neural spike events from other types of events, such as artefacts, in order to achieve the highest possible level of data quality and unit yield in the neural channels of choice. Neurons far from the electrode tips are regarded as artefacts [45], which represent spurious neural events from distantly active neurons and have a significant negative impact on the recorded signal. Individual microelectrodes can become rather close to one another in arrays or bundles of microwires. Since the volume of recordings on different channels can overlap, it is possible to capture the same unit on more than one channel simultaneously, resulting in duplicate spike events, a source of artefacts that is often overlooked. In addition, electrical interference, cable and head movement, broken cables, etc. within the recording setup could all result in spurious spike events [46]. All of these technical glitches originate from sources other than neurons, and can produce artefacts that look very similar to spikes of neurons, and can be recorded on several channels simultaneously. Importantly, an artefact in this context is defined as any waveform that did not exceed the threshold of the network setting in this work. Some templates of the artefacts captured in this work are shown in Figure 3. As was demonstrated in the preceding paragraph, corrupted segments are detrimental to any analysis and therefore must be identified and dealt with in order to produce reliable results. This includes artefact removal or discarding the segment. By analyzing the recorded data closely, it is found that artefacts across channels frequently share very similar event shapes.
The initial assumption is that the electrode arrays record a variety of spike waveforms, either recognized as authentic spikes or spurious (or artefact) waveforms. By implementing artefact removal, a reversing process is introduced to remove spurious waveforms from the active neural channels that have an adverse impact on the overall classification performance. The labeled training data in Section II-B was used to achieve this goal of designing supervised learning. The architecture of the artefact removal is shown in Figure 4. The temporal characteristics of the spike waveform have a significant effect on accuracy. For this reason, a 1D CNN was trained on the temporal pattern.
The requirements for the architecture must be able to identify and extract the most crucial abstract features and not be limited to specific feature categories. The artefact removal CNN model receives 48 samples as input. The 1D CNN utilizes three convolutional layers and two pooling layers followed by a fully connected layer to process each data segment, and the softmax function is employed in the final layer, which outputs a value denoting the probability of a spike or artefact.
Unique features are extracted in the convolutional layer. Each filter is convolved across the width of the input segment and then slides with a stride of 1, to extract the most abstract features that differentiate the spike waveforms from artefacts. This produces 1D feature maps that have been convolved. Smaller strides result in more comprehensive and dense feature extraction and do not exclude an excessive amount of information. Consequently, the value of the stride was fixed at 1. Using a smaller convolution kernel allows for more feature extraction with less data than when using a larger convolution kernel. Then, nonlinearity is introduced using an activation layer. By introducing nonlinearity into the data and employing the ReLU f (x) = max(x, 0), the optimization problem is solved. Max-pooling is also utilized to eliminate superfluous data and reduce computational complexity.
The cross-entropy cost function is minimized using an L 2 regularization term. In addition, batch normalization is utilized to standardize the intermediate outputs of the model to a zero mean and unit variance for each mini-batch of training inputs. The hyperparameters are tuned using Minibatch SGDM, as described in Section II-C.
The softmax function was deployed in the final network layer. The softmax classifier is used to predict whether the input data is a spike or an artefact. The final connected layer has two output dimensions, i.e, spike or artefact.

E. FEATURE EXTRACTION AND CLUSTERING
The essential part of this study is to show that removing artefact events improves the spike sorting performance even with the simplest clustering algorithm. Typically, neural data generated too far from the recording electrodes' tips and artificial events are regarded as artefacts. Consequently, it is difficult for any classification algorithm to distinguish them as distinct clusters. However, the proposed DSD algorithm notably isolates the artefacts. After removing artefacts, clustering is therefore a trivial task.
PCA is used as a feature extractor in this work on the spike data. Using the eigenvectors with the most variability in the acquired data, the high-dimensional spike events are projected onto 2D or 3D principal component space using PCA. The feature vectors are Z-normalized to ensure that all features have a mean of zero and a standard deviation of one.
The first stage is to transform the spike waveform data into fewer dimensions by extracting the most crucial features that differentiate the waveforms, thus forming clusters. Using this method, meaningful features were successfully extracted even from signals with high structural similarity. The low dimensional projection of PCA does not capture enough discriminatory power. Therefore, in this study, a criterion that keeps a certain amount of variability in the presented data intact and constructs low dimensional feature vectors is applied. The number of principal components (features) selected is 85% of the variability of the data. The criterion of keeping 85% of the variability intact resulted in at most seven or eight principal components.
The K-means algorithm is then used to classify the features with a squared Euclidean distance metric to initialize the centres of the specified number of clusters. The maximum number of predicted clusters in this study is three. For near-optimal clustering analysis, the K-means tool in MATLAB was utilized with 10 iterations.

III. RESULTS
In this paper, three metrics were used to evaluate the sorting performance. The metrics for precision and recall will be reported for the channel's selection and artefact removal, while accuracy will be used for clustering performance.
Precision (P) is determined by dividing the number of true positive spikes (TPS) by the total number of all detected spikes, which corresponds to the sum of (TPS) and false positive spikes (FPS) due to artefacts and overlapping perturbations.
Mathematically, it is expressed as; The recall (R) is calculated by dividing the number of true positive spikes (TPS) by the total number of detected spikes, including (TPS) and the false negative spikes (FNS) which show the spiking activities comprising those that were detected and those that were incorrectly labeled as artefact. R is calculated using The classification accuracy (CAcc) is the sum of the true positives for each cluster and the total number of activities for each cluster (i.e., a generalization of recall), expressed as where N is the number of clusters.
Algorithm implemented in MATLAB R2021b on a Windows platform personal computer with an Intel(R) Core (TM) i5-10310U CPU running at 2.21 GHz, 16 GB of RAM, and an Intel(R) UHD GPU. The training phase of the models was written, implemented, and trained in MATLAB R2021b on a 2 x AMD EPYC 7543 32-Core CPU, 512GB RAM, 4 x NVidia 80Gb Tesla A100 GPU server.

A. CLASSIFICATION ACCURACY OF CHANNEL SELECTION
The channel selection CNN assigns one of two labels, as neural or artefact, to a feature vector consisting of 20 waveforms. The trained model of the channel selection algorithm was evaluated on the test data. It performs well in classifying the feature vectors across all subjects, with few false negatives. The model has wrongly classified only four channels out of seven hundred and ninety-two channels, as shown in Table 2, which shows that the model has great generalization capability. Its performance was consistently evaluated on the individual channels to evaluate the effectiveness of the model across all channels.
In the recording sessions with the micro-wire implanted device, seven of the channels did not record any data. From the remaining nine, eight could record neural activity. All feature vectors of channel 1 and channels 3-9 were labeled as neural channels. Channel 2 did not record any neural data and was therefore labeled an artefact channel. The algorithm detected both neural and artefact channels with more than 95% accuracy in each channel. With an average accuracy of 99.5%, the channel selection correctly selects the channels recording neural activity.
In order to correctly train the CNN model to be able to achieve the best classification performance, hyperparameters need to be optimized. The batch size is one of the main hyperparameters that need to be tuned. To evaluate the impact of batch size and learning rate on classification performance, several experiments with different parameter sets were performed. This study examines the impact of varying the learning rates while the batch size remains unchanged. Table 3 depicts the detailed results. On the contrary, attempts were made to compare experiments with varying batch sizes while maintaining the same learning rate. Table 4 displays the results. As demonstrated, batch size and learning rate are interdependent and can have a significant effect on classification performance.   Table 4, the accuracy is comparable for the same batch size. During most training trials, with the same learning rate of 0.001, the classification accuracy and batch accumulation time increase with increasing batch size (Figure 5a). The average time it takes for all channels to accumulate waveforms is depicted in Figure 5b.
It is essential to optimize the trade-off between accurate classification and batch size selection. When the batch size was set at 20, the classification accuracy was 97.5%. With increasing iterations steps and a batch size of 65 prior to saturation, the classification accuracy was 99.5%. Comparing the two plots reveals that the channel selection CNN requires 270 ms to construct a feature vector with a batch size of 10. This yields an acceptable classification rate of 97%. The accuracy of batch accumulation is a significant factor in online decoding. Therefore, in order to perform online decoding, it is essential to construct a feature vector and track neural signals from each channel. The batch size decision may be less constrained for offline spike sorting.
From the experimental analysis demonstrated as shown in Table 3, the best accuracy and the lowest loss were achieved with a learning rate of 0.001 and batch size of 20. A batch size of 20 is selected for this work, while the selection of batch size is application dependent.

B. CLASSIFICATION ACCURACY OF ARTEFACT REMOVAL
The channel selection phase achieves a 99.5% rate of accuracy in identifying channels with neuronal activity, i.e., only four wrong predictions out of a total of seven hundred and ninety-two channels (see Table 2). The trained model of artefact removal was evaluated on identified neural channels by the channel selection phase performance to determine the robustness of the model to evaluate the likelihood that a neural waveform is a spike event or artefact. The artefact CNN model is trained on different waveform types from a diverse set of subjects in the training set. The model was exposed to variability in classification of previously sorted spikes utilizing KiliSort, a MATLAB-based application, with each waveform assigned a binary label of y i = 1 if it is a spike and y i = 0 if it is an artefact. The artefact removal performance remains consistent during all evaluations, with an accuracy (recall) of 88.9% and 95.4%. The overall classification accuracy (CAcc) of the artefact removal is 92.3%.
For visualization purposes, three different samples were selected. As the results show, Figure 6 depicts the waveforms with predicted labels for each class and the PCA-based projection of waveforms onto 2D feature space. The performance of the artefact removal on three different types of recording channels was shown. Figure 6a depicts the performance of the artefact removal on the channel where spike events and artefacts are sufficiently separated. In this case, it is easy for even a simple clustering algorithm to accurately identify the two clusters.
In Figure 6b, neural events and artefacts are partially overlapped in PCA space. Hence, conventional spike sorting pipeline clustering algorithms will struggle to automatically extract discriminative features to distinguish between two clusters and systematically fail when spike events and artefacts overlap. However, Figure 6c shows another type of channel where neural events and artefacts are almost completely overlapped. This will be a very challenging task for a clustering algorithm to clearly define the differences between the neural waveforms of two clusters due to the overlapping artefact. The artefact removal has identified the spike events for each neuron across all channels and discarded the artefacts. It aided the classification process simplifies as shown in Figure 6c. Despite increasing artefacts and overlapped waveforms as shown in the output waveforms (spike and artefact) in PCA space, the artefact removal in this work successfully isolates the overlapped clusters as shown in the mean waveform (spike and artefact) plots. The results reveal that the addition of the artefact removal algorithm shows robust learning capability as a pre-processing step that improves the performance of the clustering algorithm.

C. CLASSIFICATION ACCURACY OF THE CLUSTERING ALGORITHM
The channel selection in conjunction with the artefact removal nearly cleaned the neural data of artefact events in two complementary steps. To extract the most abstract features, the neural waveform data is transformed to fewer dimensions using PCA, which differentiates the waveforms forming the clusters. The extracted features were fed into the K-means algorithm to identify the neural waveforms that are associated with each cluster on a single channel (i.e., the maximum number of clusters is 3). In addition to showing good reliability and robustness under varying noise levels, the results demonstrate the proposed method's ability to extract features that cluster well and are easily separated from one another. If more advanced methods for feature extraction are used, the accuracy is very likely to improve.

1) SIMULATED DATASET
To evaluate the effectiveness and classification accuracy (CAcc) of the proposed pipeline, comparisons were made with a conventional but powerful method, Wave_clus [47], with the simulated dataset. Table 5 and Table 6 display the outcomes for the data with varying degrees of noise. Table 5 shows the number of false negative spikes (FNS) and false positive spikes (FPS) compared with Wave_clus. The DSD gives high accuracy in both cases where overlapping spikes shown in parentheses are included and excluded. In a few cases, the method described in this paper performed suboptimally. Table 6 shows the classification accuracy in comparison with Wave_clus. The proposed method demonstrated an accuracy greater than 93%, in contrast to the Wave_clus spike sorting algorithms, which demonstrated an overall classification accuracy of less than 89% when the level of noise was greater than 0.3 in the Easy1 sub dataset. More-TABLE 6. Comparison of classification accuracy on simulated dataset, the proposed DSD algorithm versus Wave_clus [47] and SpikeDeep-classifier [28]. over, the pipeline produced the best performance on Diffi-cult2, the most difficult subset from the simulated dataset, with an average spike sorting classification accuracy of 91.48% as compared to Wave_clus's 81.89%. In most cases, the pipeline showed superior performance over Wave_clus, except in Difficult1, where Wave_clus outperformed the proposed method when the noise levels were 0.15 and 0.2. This poor performance is one of a few cases where the DSD scored sub-optimally due to the higher number of false positives in the pipeline than Wave_clus, as shown in Table 5. Furthermore, the Wave_clus used powerfull feature extractor, dimensionality reduction and classification (discrete wavelet transform (DWT)>Kolmogorov-Smirnov (KS)>superparamagnetic (SPC)) in contrast to DSD utilizing only K-means. On the contrary, the average classification accuracy of the pipeline is 91.53%, an improvement of 3.38% over Wave_clus. Compared to Wave_clus, the advantages of the proposed method were more apparent. Across all datasets, Wave_clus deteriorated as noise levels increased. The performance of the DSD algorithm was also compared to that of SpikeDeep-classifier [28], a deep learning-based spike sorting algorithm, on datasets from [37]. As shown in Table 6, the DSD algorithm classifier maintains a slightly higher accuracy of 1% than SpikeDeep-classifier and obtains a comparable number of true-positive clusters. A potential reason for this slightly higher accuracy is that the focused class of artefacts in this work has a more negative impact on the spike sorting pipeline compared to the background activities of SpikeDeep-classifier [28], despite the advanced feature extraction and clustering of SpikeDeep-classifier. Furthermore, the DSD demonstrate a high level of reliability with a large dataset and functions in a truly multichannel environment. The results show that the DSD algorithm is more robust than the SpikeDeep-classifier algorithm, with slightly fewer hits but consistently fewer false-positive clusters.
The computational cost of algorithms is an important factor for online applications with microelectrode arrays with hundreds of channels. Therefore, a lower computational cost makes it more suitable as a feature extractor for BMI decoding applications. The total number of parameters in DSD is 30% less than SpikeDeeptor and SpikeDeep-classifier. The number of floating point operations (FLOPs) and multiplyand-accumulate operations (MACs) of the DSD are 30% less than SpikeDeeptector and SpikeDeep-classifier's FLOPS and MACS. The DSD is 25% less computationally intensive than SpikeDeep-classifier and thus will be more easily implemented on a hardware resource-constrained device as a real-time spike sorting processor employing a deep neural network.

2) EXPERIMENTAL DATASET
Experimental datasets are used to evaluate the proposed DSD pipeline. As shown in Table 2, only one hundred and nine of the five hundred and seventy-six channels in the human dataset (Utah array) were predicted to be neural channels. Only a small number of channels record neural activity from multiple neural sources that are each located in different brain areas. However, only three neural sources were captured on a small number of channels. The optimal number of clusters on the recording channels is predicted using K-means clustering in conjunction with cluster accept or merge (CAOM).
The main idea is to assess the degree of similarity and difference pairwise by examining each pair of clusters that are mutually closest to one another and determining whether the two clusters should be merged. Only six channels out of one hundred and seven were predicted to have a different number of clusters. For any number of units, the overall accuracy achieved is greater than 91%. Moreover, the performance of the classification algorithm and CAOM across all individual recording sessions remains stable.
A visual inspection and evaluation of the quality of the obtained clusters are demonstrated. Figure 7 shows the classification algorithm in conjunction with CAOM. Figure 7a depicts the output of the K-means algorithm with three neural units merged into one. The CAOM output is displayed in the second row. The similarity between units is determined according to CAOM criteria. In addition, the second row of figures depicts the mean waveforms and PCA projection in 2D. Figure 7b illustrates the output that is predicted by the K-means algorithm after three neural units have been merged into two. The CAOM output is displayed in the second row. Figure 7c presents the results that were predicted by the K-means algorithm, which consists of three clusters. In this particular instance, each of the three clusters has been considered a distinct cluster.

IV. DISCUSSION
In this study, a DSD is introduced to deal with the problem of noisy channels and artefact removal. Most spike sorting studies do not consider artefact channels and artefacts, despite the fact that removing them is one way to improve data quality. However, in a setting designed for recording, it is not possible to get rid of all of the factors that could be disruptive. Consequently, it is crucial to examine the negative impacts of artefacts in neural data. Two deep neural network classifiers are trained to select channels recording neural events and separate spikes from artefacts in the selected channels in real-time extracellular electrophysiological recordings using a pool of previously spike-sorted and labelled data.
Unlike the SpikeDeeptector [27], the DSD channel selection is focused on a defined class of artefacts. It is less generic than the SpikeDetector and more sensitive to deep and temporal spikes. Though the purpose of this study was to enrich the potential of classic spike detection with a simple implementation of a deep learning-based spike detection algorithm, it unequivocally demonstrates that it is robust to the defined class of artefacts. It has a better classification score and is more efficient than SpikeDeeptector. The DSD CNN design has the advantage of evaluating the entire recording space by a convolutional filter convolving across the entire spatial dimension, capturing spatial features more effectively.
The DSD algorithm is unique in comparison to the SpikeDeeptector [27] method. The DSD makes use of CNN architecture to extract both spatial and temporal features from spike signals simultaneously recorded in multi-channel electrodes while the SpikeDeeptector uses convolution only for discovering time dependencies along the spike/artefact samples and possible dependencies between different samples. Acknowledging how the temporal features of spike signals are extremely mitigated by artefacts, the DSD CNN design focuses on extracting the spatial feature of the spike data, adapts to non-stationary data, and is, therefore, well suited for acute recordings. By incorporating a generalized prototype loss-based on distance cross-entropy, the algorithm learned features that were closer to its class to improve the intraclass compactness, which makes feature representation more discriminative.
For a single waveform input runs through the trained network, the output was a value between 0 (likely an artefact channel) and 1 (likely a neural channel), which represented the network's output probability of a channel to be classified as a neural or artefact channel. In contrast, SpikeDeeptector introduced a criterion to assign labels to entire channels by calculating the mode of the predicted outputs of all the feature vectors of the given channel.
Contrarily to background activity rejector (BAR) [28], the input to the DSD's artefact removal included data from channels with very distinct spike waveforms. These channels still included both artefact (labeled as 0) and spike (labeled as 1) waveforms. The training set was designed to emphasize relatively well-isolated single-unit action potential shapes while also exposing the network to a variety of spike and artefact waveforms.
The proposed DSD algorithm was most effective at finding events within clusters that were already labeled as artefacts in the dataset. The evaluation of DSD on both simulated and experimental data revealed a low false-positive rate and, consequently, a high level of specificity for the algorithm. The exact FPS rate results are contingent on a number of factors, including the percentage of artefacts, the number of channels, the clusters, and so on.
The CAOM is used to estimate the exact number of neural units on a recorded channel. In this study, the maximum number is fixed at 3. Some recording configurations result in a greater number of neuronal units per channel, but this is often not the case with permanently implanted electrodes. In such a circumstance, CAOM's hyperparameters must be retuned. Consequently, an algorithm (HDBSCAN or t-SNE) that provides a more generic solution for determining the number of clusters will be used in the future in place of CAOM or clustering.
Embedding the DSD algorithm in a conventional spike sorting pipeline displayed a good degree of sorting precision in both simulated and experimental datasets. In the simulated dataset, even with varying degrees of noise and similarity between the spike waveforms, the proposed DSD pipeline performed extremely well in classifying the spikes. Consequently, it is plausible to conclude that the proposed DSD pipeline demonstrated excellent detection and accuracy. The performance of the approach was based on two factors: the selection of neural channels and the removal of artefacts. Both phases had a considerable positive impact on performance and accuracy. In future work, the deep neural networks will incorporate CAE and unsupervised subspace learning [48] to extract discriminative features that will handle issues of computational complexity.
In multiple channel recordings with microelectrode arrays, the recorded activity may be the overlap of multi-neuron spikes, which will degrade the traditional spike sorting performance. The existence of this issue has been widely reported. Some new methods have been proposed to solve the problem of overlapped spikes [49]. However, the challenges associated with the overlapping of spikes are not yet well resolved. Future work will address this challenge and improve the DSD's classification accuracy further by utilizing deep neural networks.

V. CONCLUSION
In this study, a deep neural network-based framework to classify neuronal spikes was proposed. By embedding two convolutional neural network algorithms for channel selection and artefact removal, respectively, in the conventional pipeline, the spike sorting pipeline largely improves the classification accuracy with enhanced detection. It improves the sorting performance by identifying the inactive recording channels as well as discarding the artefacts before feature extraction. The channel selection successfully selects the channels that record neuronal events with an average accuracy of 99.5%, while the artefact removal accuracy is 92.3%. The combined utilization of the channel selection and artefact removal with K-means clustering is potentially promising and yields a classification accuracy of 87% and 91.53% on experimental and simulated datasets respectively, outperforming the traditional approach.