Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers

Neuromorphic vision is a bio-inspired technology that has triggered a paradigm shift in the computer-vision community and is serving as a key-enabler for a multitude of applications. This technology has offered significant advantages including reduced power consumption, reduced processing needs, and communication speed-ups. However, neuromorphic cameras suffer from significant amounts of measurement noise. This noise deteriorates the performance of neuromorphic event-based perception and navigation algorithms. In this paper, we propose a novel noise filtration algorithm to eliminate events which do not represent real log-intensity variations in the observed scene. We employ a Graph Neural Network (GNN)-driven transformer algorithm, called GNN-Transformer, to classify every active event pixel in the raw stream into real-log intensity variation or noise. Within the GNN, a message-passing framework, called EventConv, is carried out to reflect the spatiotemporal correlation among the events, while preserving their asynchronous nature. We also introduce the Known-object Ground-Truth Labeling (KoGTL) approach for generating approximate ground truth labels of event streams under various illumination conditions. KoGTL is used to generate labeled datasets, from experiments recorded in chalenging lighting conditions. These datasets are used to train and extensively test our proposed algorithm. When tested on unseen datasets, the proposed algorithm outperforms existing methods by 8.8% in terms of filtration accuracy. Additional tests are also conducted on publicly available datasets to demonstrate the generalization capabilities of the proposed algorithm in the presence of illumination variations and different motion dynamics. Compared to existing solutions, qualitative results verified the superior capability of the proposed algorithm to eliminate noise while preserving meaningful scene events.


I. INTRODUCTION
O ver the last decade, advances in image sensor tech- nologies have rapidly progressed, providing several alternative solutions for scene perception and navigation.The neuromorphic event-based camera also known as Dynamic Vision Sensor (DVS) is an asynchronous sensor that mimics the neurobiological architecture of the human retina.It has caused a paradigm shift in vision algorithms due to the way visual data is acquired and processed.Instead of capturing image frames as conventional cameras, event-based cameras report asynchronous temporal differences in the scene and form a continuous stream of events which is generated when the log-intensity of each pixel changes (i.e.events) in the order of microseconds (µs).The event-based camera has the capability to overcome the limitations of conventional cameras by providing data at low latency (20 µs), high temporal resolution (>800kHz), high dynamic range (120 dB), and no motion blur [1].These sensors are able to operate in a wide range of challenging illumination environments (i.e.low light conditions), while consuming an extremely low amount of power e.g., 10-30 mW [1].
Recently, event-based cameras have been successfully employed to perform challenging tasks such as object tracking [2], object recognition [3], monitoring [4], depth estimation [5], optical flow estimation [6], high dynamic range (HDR) image reconstruction [7], segmentation [8], guidance [9], [10], and simultaneous localization and mapping (SLAM) [11].In the literature, the performance of such event-based applications degrades in the presence of noise [1].The noise associated with the generated event data using DVS could be due to the lighting conditions, motion dynamics in the scene, or the sensor parameters.Extraction of meaningful event data in presence of noise is considered a major challenge and needs further developments as mentioned in [1].
In poor lighting conditions, events corresponding to features or edges of moving objects are highly scattered and an overwhelming amount of noise is present even if optimal camera parameters are used [11], [9].Due to the humongous amounts of events generated by DVS, manually identifying and filtering noise out is a challenging task and therefore research efforts are needed especially towards noise identification and filtration in the presence of challenging lighting variations.To date, a mathematical model that accurately describes the noise associated with event streams is not yet formulated.To circumvent such challenge, machine learning approaches can be employed to approximately model and characterize the noise parameters and consequently filter out events that do not correspond to real intensity variations in the scene.However, the lack of labeled datasets to train event-denoising models has hindered the progress of machine learning solutions to this problem.In this paper, we propose Known-object Ground-Truth labeling (KoGTL) approach for generating approximate ground truth Fig. 1: Denoising results using IndoorsCorridor publicly available dataset in low light scenario [12].Events (yellow dots) are overlaid on the corresponding APS image for visualization.(a) Raw DVS stream of events and (b) Denoised events using the proposed learningbased method (GNN-Transformer).Our GNN-Transformer performs a binary classification to distinguish between actual DVS events and noise.Note that our proposed algorithm does not use APS images for denoising.All events that do not correspond to edges but are visible in the APS image have been filtered out.Our GNN-Transformer performs significantly better than the state-of-the art methods in challenging lighting conditions (i.e.low light).
labels for event streams.This is directed towards developing an ML-based event denoising technique that inherently copes with the nonlinear behavior of the noise associated with events.Graph neural networks (GNNs) have shown excellent progress in a plethora of applications [13], [14].GNN operates on data structures in the non-Euclidean domain and hence it is considered as part of the geometric deep learning framework.Particularly, GNNs operate on graphs that model a group of objects referred to as nodes and their relationships, which are referred to as edges [15].Such data structures are not supported by conventional deep neural networks (DNNs), convolutional neural networks (CNNs), or recurrent neural networks (RNNs).GNN preserves the structure of the input graph and exploits the knowledge of the dependencies between the nodes to infer knowledge about the data.Hence, we exploit this feature of GNN and propose to design a message passing GNN model that can operate on event streams, preserve the asynchronous nature of events, and learn to solely outflow the noise-free DVS events.
Recently, the transformers have attained significant attention in the machine learning community [16].Vaswani et al. proposed to model sequence-to-sequence learning task using transformer [17].The self-attention mechanism within the transformer captures the relationships between input and output data and supports parallel processing of sequence recurrent networks.Transformers have recently been employed in many applications including natural language processing and computer vision to name a few [18], [16], [19].In this work, we employ transformers within the proposed GNN for the task of identifying and eliminating the noise associated with events generated by DVS.To the best of our knowledge, no such research study exists in the literature where GNNs are employed together with transformers for event-based applications.
We propose a novel event denoising (ED) model that can learn spatiotemporal correlations between newly arrived events and the previous active events in the same neighborhood.This is achieved by means of a GNN-Transformer that operates on event streams encoded into graph structures.Our proposed algorithm consists of a message-passing GNN model and a transformer network to perform binary classification of events into real activity events or noise.The proposed GNN-Transformer based ED algorithm has the following advantages: (I) It can seamlessly operate on raw event streams without any data preprocessing or camera parameters' tuning , (II) It can efficiently perform in illumination conditions ranging from good light conditions to near darkness conditions, and (III) It shows robustness against different motion dynamics.The proposed GNN-Transformer is an accurate and general learning-based spatiotemporal event filter that outperforms existing denoising methods [20], [21], [22], [23], [24] in various testing scenarios.Through several tests on publicly available datasets [12], the proposed model has proven its effectiveness and capability to denoise incoming streams of events under challenging conditions in terms of illuminations and motion dynamics.Fig. 1 shows sample denoising results obtained when our proposed algorithm was used on a publicly available dataset recorded in low light conditions [12].Our proposed algorithm operates on event graphs constructed from the incoming raw event streams where nodes represent the event properties (pixel location and time of arrival).The node of interest, i.e. the event that has just been observed, is connected through edges to the rest of the nodes that represent recent activity in the neighborhood.Then, node features are processed to generate seven messages that are sent out along the graph edges in preparation for inference and event classification.Messages are then aggregated to form a graph signature, based on which the node of interest is classified into real-activity event or noise.Since classification is done based on the graph signature rather than the raw node features, the proposed algorithm has achieved generalization across various testing datasets.
To train and test the proposed model, we develop an experimental protocol to acquire event streams from motion in different directions and under various lighting conditions.The proposed KoGTL approach is used to label events as real activity events (class 1) or noise (class 0).The training dataset is then constructed using graph samples that encode event features and neighborhood properties, and their corresponding labels generated using KoGTL.It is worth noting that the proposed algorithm accepts input graphs of variable sizes, i.e. varying number of events in a particular spatiotemporal neighborhood.This property of the proposed ED method is very crucial since it allows for coping with the asynchronous nature of event acquisition.Experimental evaluations on various training and testing datasets demonstrate excellent performance of the proposed algorithm compared to the existing state-of-theart methods.The main contributions of the this work are as follows: 1) We introduce a novel Known-object Ground-Truth Labeling (KoGTL) approach to generate a labeled dataset of noise and real-activity events.This dataset includes varied lighting conditions and relative motions in the visual scene.
2) We design a novel message passing framework, dubbed EventConv, on graphs constructed from DVS events.Messages encapsulate the spatiotemporal properties of events in a neighborhood while accounting for the asynchronous nature of data acquisition.3) We develop a novel Event Denoising GNN-Transformer architecture based on the novel EventConv layer to distinguish between real-activity and noise events.4) We perform extensive evaluations of the proposed algorithm on our labeled dataset and other publicly available event datasets.Experiments are conducted to validate the proposed model's generalization capabilities on unseen data involving different motion dynamics and challenging lighting conditions.5) We release a new dataset (ED-KoGTL) with labelled neuromorphic camera events acquired from motions in different directions and under various illumination conditions.Our labeled dataset is publicly available to the research community <https://github.com/Yusra-alkendi/ED-KoGTL> for benchmark comparison.The rest of this paper is organized as follows.In Section II, we review related work.In Section III, we describe the proposed algorithm and dataset in detail.The experimental results are presented in Section IV.Finally, the conclusions are drawn in Section V.

A. Event denoising
The importance of the event denoising module to eventbased computer vision algorithms has been demonstrated through several research work, such as for object recognition [25], object tracking [26], image reconstruction [26], and segmentation [27].DVS produces noise due to various reasons.Noise could be generated due to thermal noise and junction leakage currents under constant lighting conditions.This type of noise is referred to as background activity noise.False negative events also generate noise and occur when there is no change in the log intensity.Furthermore, when a sudden change in illumination happens, a huge amount of random noise occurs in the event stream.
The background activity (BA) events differ from real activity events.BA lacks temporal correlation with the newly arrived events in the spatial neighborhood while real activity events show meaningful correlation.Several event noise reduction methods have been proposed in the literature.These methods can be categorized into conventional methods [28], [23], [22], [21], [29], [30] and deep learning methods [20], [31], [26].The most widely prevalent filtering approach is based on the nearest neighbor (NNb) method and hence on spatiotemporal correlation [28], [23], [22].In such filters, the properties of the previously generated events in a spatiotemporal neighborhood are utilized to determine if a newly arrived event represents real activity.The parameters of the spatiotemporal window have to be tuned by the user.Fig. 2 shows the representation of event spatiotemporal neighborhood, where the newly arrived event data at t i is marked as a red pixel and its spatial neighborhood is shown in blue.Therefore, Fig. 2: An example of event spatiotemporal neighborhood.Fig. 3: Examples of memory strategy of different spatiotemporal filters [22]: a) shows one memory cell per pixel [28], b) shows one memory cell per two sub-sampling group [23], and c) shows two memory cells for each column and row [22].such approaches require additional memory resources to retain the previous and the newly arrived events' properties for processing.
The BA filter proposed by Delbruck [28] classifies events that have less than eight other events in their spatiotemporal neighborhood as noise.One drawback of such approach is observed when two BAs are close enough in one spatiotemporal region where the filter would consider them as real activity events.Furthermore, Liu et al. [23] have proposed a filter to tackle the problem of increased memory requirements by sub-sampling pixels into groups, where instead of projecting every pixel into a memory cell, one memory cell would hold a sub-sampled group of pixels.The filtration accuracy relies heavily on the sub-sampling factor, where the filtration accuracy decreases when the sub-sampling factor is greater than 2.
Khodamoradi and Kastner proposed another storage technique for events and their timestamps to utilize less memory space [22].Particularly, the most recent event in every row and column is stored along with its corresponding polarity and timestamp into two 32-bit memory cells.Hence, if two events are acquired in the same column, but two different rows, within a short temporal window, the recent event will override the old one in the memory.This is a serious limitation of this approach as establishing spatial correlation is deemed impossible, and thereby more real activity events could be sorted out as noise.Fig. 3 depicts the techniques used to store events in the memory prior to filteration as proposed in [28], [22], [23].
To overcome memory and computational complexity issues, Yang et al. proposed a density matrix in which each arriving event is projected into its spatiotemporal region [21].The denoising process in this method consists of two steps; (1) removing random noises and (2) removing hot pixels (permanent active or permanent broken event pixels).Moving to the learning-based denoising approaches, in the literature, Baldwin et al. [20], [31] and Duan et al. [26] have proposed a convolutional neural network and U-net network to filter DVS noises, respectively.
It is also evident that the performance of the existing denoising methods rely on tunable parameters e.g., spatiotemporal window size, event camera settings, environmental illumination conditions, and camera motion dynamics [21], [22], [20], [31], [26].Such parameters are application-dependent and manually tuning them may lead to satisfactory denoising results, especially in good lighting conditions.Despite setting the camera parameters to their optimal values though, features or edges of moving objects in very low illumination conditions are highly scattered and very noisy.In order to extract meaningful information from varying light conditions, the need for a method that can reject these noises and sharpen the real event data is essential.Nevertheless, spatiotemporal correlationbased and deep learning methods of event denoising remain largely unexplored.

B. Graph Neural Networks and Transformers
Graph Neural Networks (GNNs) are deep learning models that operate on non-Euclidean data structures such as graphs.GNNs take into account the properties of each graph node and its connectivity within its neighborhood, regardless of the order in which data is provided to the neural network.It is also worth mentioning that the size of the input graph could be variable for the same network which makes GNN very well-suited for the application in hand.Owing to its expressive power and model flexibility, GNN has recently been employed in a wide range of applications e.g., visual understanding on images [32], [33]).Interested readers can explore more details in this direction in these recent surveys [34], [35].
There are different types of graph representations exhibiting various levels of complexity (i.e.number of connections and dimension) to address the problem in question.For instance, the work proposed in [36] and [37] designed graphs to represent point-clouds and ground vehicle poses, respectively.The features of the nodes and edges in each graph encode information necessary to perform the problem in hand, like the point 3D coordinates and the 2D pose of the robot.In [36], a stack of EdgeConv layers is proposed to capture and exploit fine-grained geometric properties of point clouds which are then employed to carry out classification and segmentation for point cloud data.Another graph convolutional layer is proposed in [37], called PoseConv, to carry out global optimality verification of 2D pose graph SLAM.
There are several types of GNNs, designed to fit different graph structures for different tasks.Our proposed algorithm adopts a message passing algorithm on graphs, which is carried out in two stages: message passing and aggregation [34].To construct a graph with a unique signature that reflects the nature of input data, in this work, spatiotemporal correlation functions are used.This is to reflect the nonlinear nature of the noise associated with DVS event streams.In addition, the graph isomorphism problem might occur when two different graphs might have an identical representation when reduced by the aggregation function.Inspired by [38], we employ a nonlinear activation within the aggregation stage to handle the graph isomorphism issue.This is to generate a unique graph signature to represent the spatiotemporal correlation between the nodes of the constructed graphs.
Recently, transformers have demonstrated state-of-the-art performance on a multitude of applications including natural language processing [18] and vision systems [16], [39], [40].The self-attention head captures the relationship between inputs and outputs and supports parallel processing of sequential recurrent networks.In this paper, we demonstrate the scalability of transformers on neuromorphic vision sensors and their capability to handle the asynchronous nature of events.This is designed within the graph layer that employs a message passing algorithm to process the dynamic and variant nature of event streams.The output of the graph is then processed by the transformer, prior to the final classification stage which removes noise from the event stream.

III. PROPOSED FRAMEWORK
In this paper, a novel GNN-Transformer is proposed and trained to predict if an incoming DVS event represents noise or a real log-intensity variation in the scene.Real log-intensity variation is a representation of a meaningful feature within the scene e.g., the edge of an object.The overall framework of the proposed event denoising algorithm is illustrated in Fig. 4. In the below subsections, we explain each component in detail.

A. Known-object Ground-Truth Labeling (KoGTL)
The availability of labeled datasets is key to the success of supervised learning algorithms.To that end, we propose a novel offline methodology, referred to as Known-object Ground-Truth Labeling (KoGTL) which classifies DVS event stream into two main classes: real or noise event.We use KoGTL to generate labeled datasets and train a neural network to predict whether an event represents noise or real activity in the scene.
1) Experimental setup: The main idea behind the KoGTL is to use a multi-trial experimental approach to record event streams and then perform labeling.More specifically, a dynamic active pixel vision sensor (DAVIS346C) is mounted on a Universal Robot UR10 6-DOF arm [41], in a front forward position and repeatedly moved along a certain (identical) trajectory under various illumination conditions.The UR10 manipulator ensures a repeatability margin of 100 microns along a trajectory, when performed repeatedly.The DAVIS346C provides a spatial resolution of 346×260, minimum latency of 12 µs, band-width of 12 MEvent/s and a dynamic range of 120 dB [42].The events are recorded along with two other measurements: (1) the camera pose at which the data was recorded, which we obtain through kinematics of the robot arm and (2) the intensity measurements from the scene obtained using the augmented active pixel sensor which are referred to as APS images hereafter.
Four experimental scenarios are adopted where data is acquired from repeated transnational motion of the robot along square trajectories under different lighting conditions; particularly ∼750lux, ∼350lux, ∼5lux, and ∼0.15lux.Streams of events with the corresponding APS images and robot poses were acquired for about five seconds per experimental scenario.Although the camera motion is identical in all experiments and the depicted scene (APS image) does not change, the properties of the event streams vary due to changes in illumination.Two of the experimental scenarios are used for training the proposed event denoising method, while the other two are used exclusively for testing and model evaluation.
2) Labeling Framework: The proposed KoGTL labeling algorithm is divided into three main stages including Event-Image Synchronization, Event-Edge Fitting and Event-Labeling as depicted in Fig. 5. Event-Image Synchronization: All the recorded experiments are first synchronized based on the time at which the robot arm has started moving (Fig. 5-(I)).Consequently, following identical camera trajectories allows for synchronizing events and APS images across different lighting conditions.More specifically, events recorded under poor lighting conditions can be overlaid on APS images captured at the same camera pose under good lighting conditions given that the scene is identical across all experiments.This facilitates matching events recorded in low-lighting conditions to alternative APS image features representing the same scene, which is extremely crucial for the success of the second stage.This would not have been possible using the APS images captured in lowlighting conditions where variations in intensities and hence features (edges) from the scene are absent.Event-Edge Fitting: In the second stage, Canny edge detector [43] is used to extract edges from the APS images captured along the trajectory under good lighting conditions.The events captured between two consecutive APS images (t AP S,i <= t event < t AP S,i+1 ), are accumulated for every lighting scenario forming a 2D vector as depicted in Fig. 5-(I).Using the iterative closest point (ICP) fitting technique [44], event data are fitted to their corresponding APS edge data.Fitting was done in several stages because of the high temporal resolution of DVS data acquisition.Events might slightly deviate from APS edges due to imperfections in the time-synchronization of events and APS data.Therefore, ICP is used to perfectly overlay them and correct any resulting spatial shift as shown in Fig. 5

-(II).
Event-Labeling: In the third stage, events that were fitted to edges in the APS images are labeled as real-activity events (Class 1), as shown in Fig. 5-(III).Other events that fall out of a spatial window around edge pixels (between +B and −B pixels) are considered noise (Class 0).For our dataset, events are classified as noise when they are more than two Fig. 5: KoGTL labeling framework.KoGTL is a novel DVS event labeling methodology developed to classify DVS events, acquired under various illumination conditions, into two main classes: real event or noise.The proposed KoGTL labels events that are acquired using a multi-trial experimental approach, along with two measurements, camera pose and intensity measurements of the scene.
pixels away (i.e.B = 2) from an edge in the APS image.This window size was selected based on visual observation of the fitting results using multiple B values.

B. Proposed GNN-Transformer Algorithm for Event Denoising
In this section, we explain the proposed GNN-Transformer for event denoising as depicted in Fig. 6.GNN-Transformer consists of three main stages: event graph construction, message passing on graphs, and event classification.
1) Event Graph Construction: Unlike conventional image frames, event data arrives asynchronously within a spatial resolution of H ×W pixels (Fig. 6-I).Every pixel encodes log intensity variations in the visual scene and is represented by a tuple e =<x, y, t, p>, where (x, y) are the pixel coordinates at which an event occurred, t is the event's timestamp, and p is the event's polarity (either 1 or -1, signifying an increase or a decrease in the intensity, respectively).A sequence of events within a spatiotemporal neighborhood is referred to as a local volume.The local volume is defined in terms of its spatial (L × L) and temporal (T ) dimensions around the event of interest.For example, if L = 1 and T = 1, the local volume includes the events arriving in a spatial window of 3×3 pixels around the event of interest in the previous 1 ms.
When a new event arrives, e i (Fig. 6-II), a graph G that represents the local volume of the event is constructed (Fig. 6-III).The nodes of the graph are all the events in the defined local volume.Every node has three features <(x j ), (y j ), (t j )>, where j is a node in the graph, x j , y j are the pixel coordinates at which the event occurred and t j is the event's timestamp.
In this work, we omit the use of event polarity as a node feature because of the fact that event polarity is affected by the sensitivity of events to changes in scene illumination which may vary with different camera parameters.Directed edges are added from every node in the graph to the event of interest.More specifically, all neighboring events (nodes) will be connected to the newly arrived event (node or event of interest) that will be identified by the neural classifier.It is worth noting that the graph could be of variable size, i.e. every sample might include a different number of nodes.A very important property of graph neural networks, is their ability to handle graphs of varying sizes, i.e. including variable number of nodes.This makes our approach more flexible since it facilitates operation on events arriving asynchronously at a variable rate.
2) Message passing on Graph -EventConv Layer: After constructing the event graph, messages are exchanged along the outgoing edges, from source nodes j to the node representing the newly arrived event i in the graph.The process of computing, sending, and aggregating the messages at the receiving node i is carried out by the proposed EventConv layer.Every node constructs a message consisting of its three features and sends it to node i for further processing.After receiving all the messages, node i, that represents the newly arriving event, processes and aggregates them.More specifically, the average of each of the node features <(x), (y), (t)> across the graph is computed (Fig. 6-( 1)).The average values x, y, and t are then used to estimate the spatiotemporal correlations among the events in the event graph G.More specifically, the relationship between the event of interest and its neighboring events in space and time are encoded into seven quantities, which are: (Q 1 ) the spatial difference in x, (Q 2 ) the spatial difference in y, (Q 3 ) the temporal difference, (Q 4 ) the standard deviation in x, (Q 5 ) the standard deviation in y, (Q 6 ) the standard deviation in t, and (Q 7 ) the euclidean distance.The computations of these quantities are depicted in Fig. 6-(1) and denoted as (Q 1,L , ...Q 7,L ), where L represents the node index.These quantities were selected based on the results of an ablation study, as described in the following sections.Each of these quantities is passed through a linear layer followed by a sigmoid activation function prior to aggregation.Quantities of the same type across the received messages are summed up.This operation results in a 1D vector representing a unique graph signature which is referred to as h.The uniqueness of graph G signature circumvents the problem of isomorphism where two different graphs are represented by the same signature after being reduced in the aggregation stage [38].Message passing and aggregation steps are carried out as part of the GNN which is used in conjunction with transformers to perform classification.The steps explained above are depicted in Fig. 6-(1).
3) Proposed GNN-Transformer Classifier: The overall architecture of the proposed learning-based classifier consists of two main parts including a graph neural network and a transformer.In this section, more details about the structure selection are explained.Overall, for every acquired event in the stream, a graph is constructed to reflect the spatiotemporal correlations between this event and the previous events in its neighborhood.The proposed GNN operates on these graphs and outputs a graph signature, previously referred to as h.This graph signature is passed to the transformer for further processing.More particularly, the graph signature h is mapped to another representation by the transformer network and finally the binary classification is performed.The output of the proposed GNN-Transformer is a noise-free event stream that accurately resembles the activity in the scene.
Transformer is a sequence to sequence encoder-decoder network [17].The self-attention mechanism encapsulates the interactions between all elements of a given sequence for structured prediction tasks.The attention mechanism with the Query-Key-Value (QKV) model enables the transformer to have extremely long term memory [17] and to execute dependencies between input and output, and consequently execute more parallelization.The multi-head attention layer comprises multiple stacks of self-attention.A Multi-Head Attention mechanism encapsulates a given sequence of elements into multiple jointly complex relationships by projecting them into three learnable weight matrices, called Query, Key, and Value.In these matrices, computed weight distribution on the input sequence reflects the uniqueness of graph signature through assigning higher values to more representative ele-ments.Basically, each element in a given input sequence in the multi-head attention layer is updated by concatenating and aggregating global representative information.
Given a graph signature h with n elements (h 1 , h 2 , ..., h n ), the objective of self-attention is to encode the global interaction information that exists among the elements.To achieve this, three learnable weight matrices are defined: Queries (W Q ∈ R n×dq ), Keys (W K ∈ R n×d k ), and Values (W V ∈ R n×dv ), where W is the learnable weight matrix, n is the size of the input features in h, and d q , d k , and d v represent the dimensions of query, key, and value vectors, respectively, d q = d k = d v = n in our model.In the first step, the input sequence h is projected onto these weight matrices to obtain Q = hW Q , K = hW K and V = hW V .Z ∈ R n×dv is the output of self-attention layer and is computed as follows: The most commonly used attention functions are the additive attention [45] and dot product attention [17].In our model, dot-product attention, which is a simple matrix multiplication, is selected to update the state within the encoder and decoder units.This makes the attention process and its computations much faster and more space-efficient.In the multi-head attention process, outputs from d self-attention units are concatenated into one vector [Z 1 , Z 2 , ..., Z d ] and are then projected by an output weight matrix W o ∈ R nd×n , as follows: Furthermore, the multi-head attention transformer facilitates identification of jointly complex relationships and makes the model easier to interpret.Transformer encoder: The architecture of the encoder and decoder layers within the transformers follows the original structure in [17] which consists of a multi-head self-attention unit and a feed-forward network.The mathematical operations in a single encoder unit can be formulated as follows: where N denotes the number of encoder layers, MHA represents the multi-head self-attention module, LN denotes the operation of layer normalization [46], and F E denotes the output of the decoder layer.FFN is the feed-forward network which contains two fully connected layers with a ReLU activation function in between as in (8).
Transformer decoder: For the Transformer decoder unit, it takes the decoder's outputs as inputs and has two multi-head self-attention modules (MHA) followed by a feed forward network (FFN).The mathematical operations within a single decoder unit can be formulated as follows: where l denotes number of decoder layers and F D represents the output of the transformer unit (F D ∈ R n×1 ) which reveals important features to uniquely represent the graph signature (h).The output of the coupled GNN-Transformer is finally passed to a fully connected layer that generates a 2 × 1 tensor for every sample in the dataset, where 2 is the number of classes: real log-intensity change or noise.The output tensor is passed to a softmax function (16) , where it is rescaled so that the elements are in the range [0, 1] and sum up to unity.The rescaled elements represent the probabilities that the event under investigation represents noise or real-activity, respectively.

Softmax(x
Supervised learning is performed using the backpropagation algorithm to train the GNN-Transformer network.Pytorch [47] implementation is used for constructing all the neural networks and performing training and testing.The training process is carried out to minimize the cross-entropy loss function using the Adam optimizer [48] with a learning rate of 0.001.Ablation Study: To select the most suited structures of both the GNN and the transformer, an automated search routine was developed.The automated search routine spanned several parameters including the graph structure, the message operation, the aggregation functions, the number of EventConv layers in the GNN, the activation functions, and the number of encoderdecoder units in the transformer.Such parameters reflect the nonlinear capacity of the model and hence need to be carefully selected to best suit the problem in question.It was observed that several architectures have achieved comparable performance and were able to correctly classify the majority of real-activity and noise events. Figure 7 reports the loss obtained by the highest performing architectures on the training dataset among the tested neural networks.The loss curves are grouped based on the adopted neural network architecture; GNN, GNN in conjunction with a transformer of a single encoder-decoder layer (GNN-Transformer 1E1D), GNN in conjunction with a transformer of a double encoder-decoder layer (GNN-Transformer 2E2D), and GNN in conjunction with a transformer of a triple encode-decoder layer (GNN-Transformer 3E3D).For every architecture, the number of quantities composing the messages that characterize the spatiotemporal correlation within the graph was varied.More specifically, four combinations of quantities in the message were tested as indicated below: The performance of all the attempted networks is evaluated using unseen testing datasets, which are composed of streams of events obtained experimentally.The performance evaluation metrics used to compare the training and validation results are the accuracy, signal ratio, noise ratio, and signal to noise ratio as computed with respect to the ground truth labels obtained using our proposed KoGTL for each event.
Training and testing results have proven that the GNN-Transformer architecture with 7Qs-MSG in the EventConv layer as described in Section III-B2 and a transformer with a double encoder-decoder layer showed the best performance among all candidate neural classifiers in terms of the noise filtration accuracy as reported in Table IV in the supplemen-tary material.The proposed GNN-Transformer architecture is depicted in Fig. 6-IV.
It is worth noting that the quantities included in the messages play a pivotal role in reflecting the spatiotemporal correlation of the event and its neighboring events and thus in the overall performance of the filter as clearly shown in loss curves of the GNN-Transformer 3E3D.More specifically, although the architecture of the neural network was complex enough, the number of quantities in the message drastically affected the filter's performance.

IV. EXPERIMENTAL EVALUATIONS
The proposed GNN-Transformer algorithm for event denoising is tested qualitatively and quantitatively in multiple scenarios to demonstrate its validity, effectiveness and generalization.The training process including training and testing data preparation is described in Section IV-A.In Section IV-B, the evaluation metrics used to quantify the results are presented.Section IV-C presents the quantitative performance analyses of the developed GNN-Transformer model.Moreover, the GNN-Transformer model is benchmarked against other existing event denoising methods, where the developed model's capability, effectiveness, and validity are discussed.In addition, the performance of the model is evaluated qualitatively on part of the datasets that we have recorded, but have not exposed to the network during training, as well as several publicly available datasets as presented in Section IV-D.This is to prove the model's generality and robustness to various illumination conditions and unseen data.

A. Training and Testing Datasets
Training and testing datasets are constructed from experiments recorded in our lab as well as other publicly available datasets.Training is exclusively done using our recorded dataset because of the availability of ground truth labels to support supervised learning.Testing, on the other hand, is done on both recorded and publicly available datasets where quantitative and qualitative evaluations are done.
Recorded experiments were conducted following the approach described in Section III-A using the iniVation's DAVIS346C dynamic vision sensor [42].Four lighting conditions were used to record experiments; very good lighting (∼750lux), office lighting (∼300lux), low light condition (∼5lux), and Moon light condition (∼0.15lux).Every experimental scenario includes scenes recorded when the camera is static or is starting translational motion, and scenes recorded when the camera is moving in four different directions.In the former case, static noise pixels can be detected and learned accordingly.The latter cases exhibit the dynamic nonlinear nature of event and noise generation as well as spatiotemporal correlations of an event and its neighborhood when the camera is in motion.
Samples from the experiments recorded under very good lighting (∼750lux) and low light condition (∼5lux) were used for quantitative analysis (training and testing).Each sample consists of a newly arrived event and its corresponding neighboring events within the defined spatial and temporal window.More specifically, for each scenario, a total of 8000 samples were randomly selected from each of the five scenes (static and motion in four directions); 4000 real-activity events and 4000 noise samples.This is to ensure that the training dataset is balanced and is not biased towards one class more than the other.Hence, a total of 80k samples constitute the dataset; 80% of which are used for training and 20% are used for testing.
Moreover, qualitative analysis of the model's performance on two recorded experiments (∼300lux and ∼0.15lux) and eleven publicly available datasets was carried out.The publicly available datasets [12] include indoor and outdoor scenarios and were recorded at numerous illumination conditions and using different motion dynamics as summarized in Table I.Prior to training the model, every sample event and its corresponding neighborhood are used to construct a graph, which is used as the input to the graph neural network.The size of the neighborhood, i.e. the local volume, is selected to be a maximum of 10 nodes (or events) within 5 by 5 pixels window centered at the event of interest in the preceeding 50 ms.In case more events were acquired in this volume, only the latest 10 are included in the graph.It is worth mentioning that the volume size was selected after several experiments with varying volume parameters.It was observed that 10 neighboring events in the local volume are sufficient to delineate the spatiotemporal correlations and hence make a decision on whether the event of interest is real or noise.To expedite training and convergence, it is common practice to normalize all the inputs to the neural network to a common range.In this work, all inputs are rescaled to the range [0.05, 0.95], excluding values very close to 0 and 1 to avoid the issue of neuron saturation which causes the problem of vanishing gradients.For example, the minimum and maximum values of sigmoid are 0 and 1 respectively.The corresponding derivative at those values drops to zero, causing gradients to vanish.

B. Evaluation Metrics
To quantitatively evaluate the performance of the proposed denoising model and compare to state-of-the-art models on training and testing datasets, four evaluation metrics are used: Accuracy, Signal Ratio (SR), Noise Ratio (NR), and Signal to Noise Ratio (SNR).
a) Accuracy: This metric measures the model's ability to correctly predict real activity events and noise, as defined in (17).
where TP, FP, TN, and FN are the number of true positives, false positives, true negatives and false negatives pixels, respectively.TP indicates the number of events that are correctly predicted as real activity events, whereas TN indicates the number of events that are correctly predicted as noise.b) Signal Ratio (SR): This metric represents the proportion of correctly predicted real-activity events with respect to the total number of real-activity events in the scene, which is also known as precision, as defined in (18).
c) Noise Ratio (NR): This metric represents the proportion of incorrectly predicted noise events with respect to the total number of noise events in the scene, which is also known as the false omission rate, as defined in (19).
d) Signal to Noise Ratio (SNR): This metric is the ratio of the number of correctly predicted real-activity events to the number of noise events incorrectly labeled as real-activity events as described in (20).
The performance of the denoising model is considered better with higher SR and SNR values and lower NR values.

1) Evaluation on Training and Testing Datasets:
In this section, the performance of the proposed GNN-Transformer based Event Denoising model is compared against state-ofthe-art denoising methods, namely EDnCNN [20], Yang Filter [21], Khodamoradi Filter [22], Liu Filters [23], and Nearest Neighbor NNb filter [24].All filters are tested on the same dataset, which was used to train our proposed approach.The dataset was randomly split into training and testing subsets, where 80% of the samples were used for training and 20% were used for testing (not exposed to the network during training).EDnCNN filter's parameters were set to those mentioned in their published trained model which consists of 3 × 3 convolutional layers followed by two fully connected layers.To filter an event, a spatiotemporal window of 25 × 25 × 5s centered at that event pixel is considered to construct the input feature to the model.More specifically, a 25×25×k×2 matrix is populated with the k most recent positive and negative events that were received prior to the event of interest, where k was set to 2. The pre-trained EDnCNN model parameters [20] were used to perform accuracy evaluations on both of our training and testing datasets.Yang filter's parameters were set to the default values reported in [21].More specifically, the time window was set to 5ms, spatial window is 5 by 5 pixels, and the density is 3.As for Khodamoradi filter, the time window was set to 1ms, as in [22] and [21].Two down-sampling factors S of Liu's filter were used S= 1,2 where the timestamp of 2×2 and 4×4 pixels were stored in one memory cell and the time window was set to 1ms, as test in [21].The working principle of Liu and Khodamoradi filters was previously mentioned in Section II, Fig. 3b and Fig. 3c, respectively.Lastly, for Nearest Neighbor NNb filter, the size of the event's local volume is set to 3 by 3 pixels for 1ms, as reported in their work [24].The performance of these denoising methods was compared to that of the proposed GNN-Transformer approach as presented next.
Table II reports the filtration accuracy achieved by the GNN-Transformer network, EDnCNN filter, Yang filter, Khodamoradi filter, Liu filter and NNb filter when evaluated on the training and testing datasets.It is worth mentioning that the training and testing datasets have equal numbers of real and noise events (50% real events and 50% noise events).It is observed that the GNN-Transformer outperforms all the other alternatives in terms of filtration accuracy.The proposed model has outperformed EDnCNN by 10.6% on the training dataset and 8.4% on the testing dataset.Is has also achieved 12% higher training and testing accuracy compared to Yang filter.While Yang filter has shown the best performance compared to other conventional filters (Khodamoradi, Liu, and NNb filters) in terms of filtration accuracy.
A high SNR value does not necessarily mean that a filter's performance is better than others.Rather, a high SNR value, a high SR value, and a low NR value together would indicate a good filtering performance.A clear example is the Khodamoradi filter, which achieved the highest SR (99%) and the highest NR (92%) values among other filters.These values mean that all input data have been considered real-activity and no noise filtration took place.In other words, the filter could not distinguish between the incoming real-activity events and the accompanying noise.
Another example is Liu's filter, which achieved the lowest NR (1-2%) and a relatively low SR (10-30%).In this case, most of the input data have been considered as noise.This implies the weak denoising capability of Liu's filter.Meaningful realactivity events have been filtered out and consequently scene perception algorithms would fail to operate as expected.
To conclude, the best event denoising model is expected to have a high accuracy, SR, and SNR, and a low NR.Thus, our proposed GNN-Transformer has clearly outperformed all alternative filters and proved its capability to generalize to unseen datasets.Table II compares the number of correctly and incorrectly predicted real-activity events from the training and testing datasets.2) Evaluation on our Recorded Dataset -Continuous Stream of Events: In this section, the proposed model is tested online on a continuous stream of events then compared to state-of-the-art denoising techniques.In other words, instead of randomly selecting samples from the recorded experiments, the full stream of events generated by DVS is passed through each filter, which is then evaluated, as per our labeled dataset.
Filtering techniques were tested in two scenarios; the experiments recorded at ∼750lux and ∼5lux.In the first scenario, filtering was done over 600ms, where SR and NR were evaluated every 10ms as shown in Fig. 8a.The second scenario was run for 170ms and evaluation was done at 5ms intervals as shown in Fig. 8b.Evaluations of SR, NR, and SNR over the full period of time for both scenarios are depicted in Fig. 9a and Fig. 9b.The total number of events included in this test is 7M and 0.1M for the first and second scenarios, respectively.
It is evident, through the conducted tests, that our proposed GNN-Transformer based event denoising technique has achieved the best filtering performance compared to all the other filters.This proves the effectiveness of the proposed event denoising approach and shows robustness to different camera motion dynamics under illumination variations.According to our evaluations, the second-best learning-based event-denoising technique is the EDnCNN [20] filter and the best conventional event-denoising filter is Yang filter [21].Thus, further qualitative performance assessments of our proposed approach are conducted against those two filters only as presented in Section IV-D.
3) Computational Time Complexity and Memory Analysis: In this section, time and memory analyses of the proposed approach will be discussed and compared to EDnCNN filter since both are based on using neural networks.A set of 10,000 event samples was selected from the stairs dataset presented in [20] to conduct the timing analysis.
The computational time analysis of the proposed algorithm was carried out on an ASUS laptop with Intel core i7−7700HQ@2.80GHz×4,NIVIDIA GeForce GTX 1050 Ti 4GB.The analysis was done with and without GPU support in two modes; Sequential mode: events were passed to the filter successively, one after the other, and Batch mode: all events were passed to the filter as a single batch.The time needed to filter the events in each mode was recorded for both filters as listed in Table III.In all cases, the time needed to complete the filtration was shorter using our proposed approach compared to EDnCNN.However, our approach achieved a large speedup of up to two orders of magnitude in the batch mode compared to the other filter when run on CPU, and a speedup of up to one order of magnitude when run on GPU.This speed-up is significant as operation in batch mode is certainly necessary due to the high temporal resolution of the event camera, and due to the working principle of the event camera that enables 346 × 260 pixels to be active simultaneously.In other words, the proposed approach is capable of handling batches of events concurrently in a very short period of time, and hence preserves the high temporal resolution of the sensor.It is also worth noting that the proposed approach exhibited the fastest performance when processing events in a batch mode on a CPU, which obviates the need for sophisticated hardware   [20].However, for fair comparison and for these results to make sense, the metrics have to be analyzed collectively.It was observed that EDnCNN has considered a large number of events as noise, which decreased the NR value compared to ours.However, a significant amount of these filtered events belongs to meaningful features, i.e. was incorrectly labeled as noise, which resulted in a lower SR value than ours.
to achieve fast and accurate noise filtration.This makes the proposed approach suitable for limited computational power and resource-constrained platforms such as high speed UAV control [49], UAV navigation [50], and space applications [51].
To project this analysis on a real-world scenario, consider the application of autonomous car driving where neuromorphic vision could be employed to observe the environment during navigation.As the speed of the vehicle increases, the number of generated events will proportionally increase resulting in a tremendous amount of events for processing.Faster processing of visual observations will thus result in a faster response to changes in the vehicle's surroundings.This will definitely reduce the probability of collisions and will enhance the effectiveness of the overall system.
The overall memory requirement per event classification is 5 × 5 × N g , where N g is the number of events per graph and could range from 1-10 events.Whereas in EDnCNN, the size of the input feature is 25 × 25 × 2 × 2. This clearly shows that our approach is more memory efficient than EDnCNN, where in case the graph in our approach had 10 nodes (which is the maximum number of nodes per graph), the memory requirements are 10 times less than that of EDnCNN.

D. Qualitative Results
In this section, two experiments from our recorded dataset, particularly those recorded at ∼300lux and ∼0.15lux, are used to qualitatively analyze the denoising performance of the proposed model against and EDnCNN and Yang filters.Sample filtering results, superimposed on APS images for better visualization, are depicted in Fig. 10.The results clearly show that our model has filtered out most of the background activity noise and maintained events representing relative motion of meaningful features in the scene as in Fig. 10a.Although more scattered noise is present under low lighting conditions as shown in Fig. 10b, our proposed model was able to preserve the events that represent meaningful features (edges) in the scene.Conversely, Yang filter has eliminated the majority of real-activity events from the scene, while leaving some scattered ones that could be hard to interpret as edges or meaningful features.This proves the robustness of our model against illumination variations.
To further prove the validity and generalization of our proposed model, we have extensively tested it and compared it against others using eleven publicly available datasets.These recorded data were acquired from different camera motion dynamics (type of motion and speed) and under different lighting conditions.Fig. 11 shows two examples of denoised events obtained using the proposed model, EDnCNN, and Yang filter.It was noticed that EDnCNN eliminated a large amount of events that belong to meaningful features in the scene.For instance, the filtered event stream corresponding to the scene taken from the DrivingTunnelSun dataset shown in Fig. 11a lacks significant events that represent clear intensity variations as per the corresponding APS images.Such events were classified as noise using the EDnCNN filter.The same observation can be seen in the scenes from the other datasets such as DrivingCity4 in the same figure.Yang filter passes the majority of the events (both real and noise signals), thus making it more difficult to identify objects (edges) in the scene compared to our proposed model.Therefore, the GNN-Transformer based event denoising model generalizes well to new scenarios under various illumination conditions without any further tuning of its parameters.More results are demonstrated in the supplementary material (Appendix: Fig. 12), additional results document in <https://github.com/Yusra-alkendi/ED-KoGTL>and video <https://youtu.be/ZM76UaxbuJE>,which visualize the denoising performance of GNN-Transformer classifier compared to Yang Filter [21] and EDnCNN [20].

V. CONCLUSION
In this work, we developed a novel algorithm to filter out the noise associated with event streams acquired by dynamic vision sensors.The GNN-Transformer based event-denoising algorithm exploits the spatiotemporal correlations between events in a particular neighborhood to decide whether an incoming event represents noise or a log-intensity variation in the observed scene.To train the proposed GNN-Transformer model, a novel offline event labeling technique, KoGTL, is proposed to distinguish between noise and real events in event streams recorded under challenging lighting conditions.The labeled DVS data is made available to the public research community for benchmarking purposes.The proposed algorithm successfully operates on event streams irrespective of camera parameters, illumination conditions, and motion dynamics.This is attributed to the fact that the adopted graph structure of the input data preserves the spatiotemporal correlation between the events, rather than the raw properties of the events, solely.Such operation is carried out in the proposed EventConv layer.The proposed algorithm also operates on event graphs of variable sizes and thus handles the asynchronous nature of event streams.
Through extensive training and testing, the proposed algorithm has proven to achieve significantly high denoising performance under challenging illumination conditions.Our model is also tested on eleven publicly available datasets which were not exposed to the network during training.The model is able to successfully denoise the event streams, despite the fact that the data is recorded under conditions different than those of the training data, including different environmental conditions, various camera motions, and camera parameters.The quantitative results have demonstrated the denoising capability of the proposed algorithm with at least 8.8% higher filtration accuracy on testing sets compared to existing methods.Qualitatively, the results achieved by the proposed model have verified its effectiveness and generalization to previously unseen event graph data, irrespective to their sizes.This work has unveiled the power and potential of graph neural networks and transformers on event cameras.
In the future, we plan to demonstrate the significance of our proposed denoising approach by integrating it into other event-based computer vision algorithms such as motion segmentation, object detection, object tracking, and object recognition, under challenging lighting conditions.We also plan to exploit the potential of graph neural networks and transformers for other event-based vision algorithms.Another possible extension of the current work could be by integrating the denoising module together with vision algorithms and employing them for robot navigation purposes, autonomous driving cars [52], and healthcare applications such as human fall detection [53].Eliminating noise events from the observed scene in such scenarios is foreseen to improve the accuracy of the vision algorithms responsible for localizing obstacles and detecting human fall accidents.Noise events, if not eliminated, may be mistaken for real changes in the scene intensities which could results in false positive detections.In the case of autonomous driving, falsely detecting an obstacle along the way will interrupt the vehicle's trajectory and may cause it to take longer paths and more time, which is undesirable.As for human fall detection, noise events may decrease the accuracy of localizing a human and estimating the temporal window for the accident by inflicting erroneous information into the observation.To that end, integrating the proposed denoising method into such systems is envisioned to enhance their accuracy and effectiveness.APPENDIX A ADDITIONAL QUALITATIVE EVENT DENOISING RESULTS Fig. 12 presents additional qualitative denoising results on other unseen published datasets of our proposed method compared to the state-of-the-art denoising models [21] and [20].

Fig. 4 :
Fig.4: Proposed event denoising framework.A GNN-Transformer based event denoising algorithm is developed and trained on event datasets, generated and labeled using the proposed Known-object Ground-Truth Labeling (KoGTL) approach.The proposed algorithm classifies incoming event streams into real activity events or noise.
the Corresponding Image Frame of Canny Edges -Before ICP fitting Image Frame of Canny Edges Event Data •Edge Data •Events ICP •Real Activity Event -Class 1 •Noise Event -Class 0 the Corresponding Image Frame of Canny Edges -After ICP fitting Image Frame of Canny Edges Event Data

Fig. 6 :
Fig. 6: Framework of our GNN-Transformer classifier for event denoising.Note: x and y are the pixel coordinates at which the event occurred.t is the event's timestamp.i and j are the source and destination nodes where a message is transferred in Step1-(1) EventConv layer.Q1,L, ...Q7,L are quantities that reflect spatiotemporal properties in the graph, where L represents the node index and m denotes the number of events in the local volume.h is the event graph signature.α is a learning parameter.σ is a sigmoid activation function.

Fig. 7 :
Fig. 7: Ablation study results -loss curves obtained upon training various network architectures as part of the automated search for the best suited neural network architecture.

Fig. 8 :
Fig. 8: Signal Ratio (SR), Noise Ratio (NR), and Signal to Noise Ratio (SNR) event denoising performances of the GNN-Transformer Model and state-of-the-art denoising methods -using sample stream of events recorded (a) at ∼750lux (b) at ∼5lux.

Fig. 9 :
Fig.9: Signal Ratio (SR), Noise Ratio (NR), and Signal to Noise Ratio (SNR) event denoising performances of the GNN-Transformer model and state-of-the-art denoising methods -using sample stream of events recorded (a) at ∼750lux (b) at ∼5lux.The performance of the denoising model is considered better with higher SR and SNR values and lower than NR values.It can be observed that the best performing denoising methods are ours and EDnCNN[20].However, for fair comparison and for these results to make sense, the metrics have to be analyzed collectively.It was observed that EDnCNN has considered a large number of events as noise, which decreased the NR value compared to ours.However, a significant amount of these filtered events belongs to meaningful features, i.e. was incorrectly labeled as noise, which resulted in a lower SR value than ours.

Fig. 11 :
Fig. 11: Sample of denoising results tested on published datasets (unseen data), denoised events from DVS (yellow dots) overlaid on corresponding APS image.

TABLE I :
[12]ription of the publicly available datasets used from[12]

TABLE II :
Performance of the GNN-Transformer classifier compared to state-of-the art denoising methods on the training and testing datasets

TABLE III :
[20] in seconds to filter events using our proposed approach and EDnCNN method[20].Note that µ and σ represent the mean and standard deviation, respectively.

TABLE IV :
Performance comparison of the proposed event denoising classifier and its network variants on the training and testing datasets.Note that Case I, Case II, Case III, and Case IV denote GNN, GNN-Transformer 1E1D, GNN-Transformer 2E2D, and GNN-Transformer 3E3D, respectively.