CSTR: A Compact Spatio-Temporal Representation for Event-Based Vision

Event-based vision is a novel perception modality that offers several advantages, such as high dynamic range and robustness to motion blur. In order to process events in batches and utilize modern computer vision deep-learning architectures, an intermediate representation is required. Nevertheless, constructing an effective batch representation is non-trivial. In this paper, we propose a novel representation for event-based vision, called the compact spatio-temporal representation (CSTR). The CSTR encodes an event batch’s spatial, temporal, and polarity information in a 3-channel image-like format. It achieves this by calculating the mean of the events’ timestamps in combination with the event count at each spatial position in the frame. This representation shows robustness to motion-overlapping, high event density, and varying event-batch durations. Due to its compact 3-channel form, the CSTR is directly compatible with modern computer vision architectures, serving as an excellent choice for deploying event-based solutions. In addition, we complement the CSTR with an augmentation framework that introduces randomized training variations to the spatial, temporal, and polarity characteristics of event data. Experimentation over different object and action recognition datasets shows that the CSTR outperforms other representations of similar complexity under a consistent baseline. Further, the CSTR is made more robust and significantly benefits from the proposed augmentation framework, considerably addressing the sparseness in event-based datasets.


I. INTRODUCTION
Perception plays a crucial role in real-time robotic applications, enabling their operation in dynamic and unpredictable environments [1], [2].These applications often operate under challenging lighting conditions, including high dynamic range (HDR) or high-speed motion scenes.Ensuring accurate perception and prompt responses under such conditions is vital for their success, especially in safety-or time-critical applications like autonomous vehicles [1] and industrial automation [2].For instance, in an HDR scene such as when emerging from a tunnel in broad daylight, the failure to detect objects like vehicles or traffic signs can have severe consequences [3].To address the challenges of robust operation in challenging lighting conditions (e.g., HDR The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Loconsole . or high-speed motion scenes) and in potentially dynamic and unpredictable environments, many researchers have increasingly turned to event-based vision [4], [5] as a promising alternative visual sensing modality.
Event-based sensors, such as the Dynamic Vision Sensor (DVS) [6] or the Asynchronous Time-Based Image Sensor (ATIS) [7], operate by capturing per-pixel brightness changes asynchronously and at very high temporal resolutions [6], [7].This results in a spatially sparse yet temporally dense output that effectively represents all visual changes in a scene over a specified time interval.In contrast, traditional cameras capture intensity images at a fixed rate, such as 24 frames per second [8].This fixed rate can possibly lead to oversampling of static scenes, resulting in redundant data; or undersampling of scenes with high-speed motion, resulting in motion blur [5].Overall, event-based vision offers several distinct properties that address dynamic range, response FIGURE 1. Overview of the general framework of this paper.Sparse and asynchronous events, representing brightness changes at each pixel, are captured using an event-based sensor.To utilize this spatio-temporal event data, an intermediate representation is required to leverage modern deep-learning solutions when processing events in batches.In this work, we propose the Compact Spatio-Temporal Representation (CSTR) that encodes spatial, temporal, and polarity information of event data in a 3-channel image-like format.Accordingly, the CSTR is directly compatible with off-the-shelf pre-trained computer vision architectures.
time, and motion blur issues.These properties include an HDR of >120 dB, microsecond-level temporal resolution, low output latency in the order of microseconds, and low power consumption averaging a few milliwatts [5], [6].Consequently, these characteristics make event-based vision particularly well-suited for real-time robotic applications [9], [10].Such applications require accurate perception and prompt response to visual changes, especially in challenging scenarios such as HDR scenes [11], low-light conditions [12], or high-speed motion environments [13].In comparison, traditional cameras often struggle to perform effectively in such scenarios [3], [10].
While the properties of event-based vision are very compelling, effectively utilizing event data in various applications presents a challenge.The generated event stream is asynchronous and sparse, necessitating its transformation into a compatible format for established algorithmic methodologies.For instance, most traditional object detectors and classifiers employ a three-channel input designed for RGB imagery [14], [15].However, the independence and sparsity of events make it non-trivial to establish batch relationships, often leading to the creation of hand-crafted representations tailored to specific applications [16], [17].This inherent problem hampers generalization, as traditional frame-based cameras benefit from standardized formats that facilitate the canonical transfer learning of dataset weights across tasks.In contrast, event-based algorithms, are highly sensitive to the specific type of open-source data and its representation.This further exacerbates the data sparsity issue.As a result, the data needs to be closely associated with the particular task at hand, adversely impacting generalization and posing challenges for training convergence.
Accordingly, most works resort to using image-like representations in order to leverage pre-trained computer vision models.One common representation is the Event Frame [18], [19], chosen for its simplicity.This representation keeps track of whether any event has occurred at each pixel within a given time period (where the time period is a variable that can be adjusted per task).By doing so, the batch of events is effectively transformed into a single-channel image (or can be replicated to form a 3-channel image) that can be utilized with existing algorithms.While convenient, this approach has some limitations.Notably, it binarizes the behavior for the specified sampling period, losing temporal and polarity information (brightness changes), and is generally outperformed by more sophisticated approaches [20], [21], [22], [23].Alternatively, more advanced representations have been explored to capture temporal and polarity contexts [20], [21], [22], [23].These representations demonstrate better performance, but they come with either the trade-off of notable pre-processing overhead [20], [21] or are not directly compatible with pre-trained computer vision architectures that require a 3-channel input [22], [23].
To address these challenges, we propose a novel representation for event data called the Compact Spatio-Temporal Representation (CSTR).The CSTR efficiently encodes the spatial, temporal, polarity, and event count information of a given event batch while requiring minimal processing overhead.This is achieved by calculating the mean timestamps of the events per polarity type (positive or negative) and the normalized event counts at every spatial position in the resulting representation frame.This results in a 3-channel image-like format that is directly compatible with existing state-of-the-art networks [14], [15], allowing for seamless integration without the need for additional modifications.We visualize the general framework of this paper in Fig. 1.
We demonstrate the effectiveness of the CSTR through a comprehensive series of well-established event-based recognition benchmarks.This benchmarking includes six well-known representations that are similarly compatible with off-the-shelf networks over the following datasets: N-MNIST [24], N-CARS [25], N-Caltech101 [24], CIFAR10-DVS [26], ASL-DVS [27], and DVS-Gesture [28].The CSTR is consistently an excellent performer, achieving the highest overall classification accuracy.Furthermore, the CSTR is stable when applying random augmentations; these are demonstrated to notably enhance classification accuracy, validating that the CSTR is a robust approach for encoding event data.
Event-based vision has recently seen significant advancements that leverage its unique characteristics for various applications [5], [10], [12], [13], [23].There are two general approaches to effectively utilize the asynchronous and sparse event data.These include event-by-event and batch processing.In this section, we provide an overview of the relevant methods of each approach, highlighting their strengths and identifying their limitations.Next, we provide an overview of augmentation methods explored in the literature for enhancing event data.Finally, we introduce the proposed CSTR along with a new augmentation in the context of these limitations, noting how they address some of the remaining challenges.

A. EVENT-BY-EVENT PROCESSING
Event-by-event processing methods directly utilize events as they are received [29], [30], [31], [32].This approach is intuitive and minimizes processing delays.The most prominent methods are spiking-neural-networks (SNNs) [32], [33], [34], [35], [36].An SNN is a bio-inspired version of artificial neural networks comprising interconnected neurons.SNNs operate by integrating incoming spikes (events at the input layer) over time.An output spike is generated when the membrane potential of a neuron surpasses a certain threshold causing it to reset.The generated output spikes propagate information to other neurons in deeper layers, connected hierarchically.This neuron-activation threshold enables SNNs to be computationally efficient [35], [36], [37].
Despite the computational efficiency and minimal latency of event-by-event algorithms, they suffer from some limitations.Processing events individually inherently lacks temporal context, necessitating tailored solutions to compensate for the lack of event history [29], [30], [31].Ironically, this approach can become computationally expensive during periods of high event density.Scenes with significant motion and texture can generate a substantial amount of events per second, requiring a proportional number of operations.As event-based sensors continue to improve their frame resolutions [8], [38], this computational challenge will only intensify.While SNNs somewhat address the latter with their energy-efficient design, they are non-trivial to set up and implement [32], [33], [34].Moreover, SNNs require specialized hardware, which limits their widespread adoption, posing additional barriers to deployment.

B. BATCH PROCESSING
Batch processing methods accumulate, encode, and classify the events generated in a given time period.These approaches add temporal context with the capability to provide synchronous responses (i.e., a classification per each batch period).By applying an intermediate encoding method, they have the key benefit of being able to employ modern computer-vision networks.This is directly germane to the problem statement of being able to leverage existing stateof-the-art networks (and corresponding training weights).Hence, we focus this survey on event-batch representations that are compatible with frame-based networks.

1) IMAGE-LIKE REPRESENTATIONS
Many opt to represent event batches in a simple imagelike format.These representations encode spatial, temporal, and/or polarity information into traditional one, two, or threechannel images.Such approaches are popular because they enable rapid prototyping and demonstrate strong performance across various perception tasks [18], [19], [39].For example, the Event Frame encodes the event's spatial information (i.e. the existence of any events per spatial position) [18], while the Event Count (also known as Event Histograms) [39], [40], [41] indicates the number of events recorded, instead.More advanced versions of these representations incorporate polarity information as well [19], [39], [41].These representations, however, are inherently limited as they do not capture the temporal information of the event data.To address this limitation, more comprehensive representations have been developed to incorporate spatio-temporal information in an image-like format.One popular representation is Timestamp Images [42], also referred to as Time Surfaces [17].Timestamp Images encode the timestamp information of the latest event at each spatial index [42], often represented using a separate channel per polarity type resulting in a 2-channel representation [42].Recent advancements related to Timestamp Images have explored sophisticated techniques to enhance robustness against noise [25], [43].For instance, DiST [43] incorporates temporal discounting by considering the ρ spatio-temporally neighboring events at each spatial position.Thus, discounting the timestamps of the latest events using a normalized time range of the neighboring pixels.
One challenge encountered in temporal representations is motion overwriting.While timestamp images excel in retaining contour information, the recent timestamps can be overwritten.This can happen when using long batch periods or in highly textured scenes.Accordingly, various representations have emerged that incorporate both the temporal and count information of events in different forms [44], [45], [46], [47].For instance, a 4-channel representation, known as Event Image [45], [46], incorporates recent timestamps and event count per polarity.Another work by Bai et al. [47] proposes a more compact 3D representation that includes the temporal information of both polarities as well as the event count in separate channels.This forms a spatio-temporal image-like representation that encompasses vital information about the event data.The authors also investigate the advantages of this approach in the context of event-based object recognition.
Overall, the limitation of most spatio-temporal imagelike representations can be distilled to overlapping events.A high number of overlapping events often results when using long batch periods or when operating in highly textured scenes.This can result in the overwriting of recent events causing a loss of information.Shortening the batch period can potentially limit this issue [45], however, this reduces temporal context and increases processing frequency.
As an alternative, image reconstruction from events is an effective approach that results in intensity images that enable the direct use of modern frame-based computer vision architectures [48].However, generating images from events is a very processing-heavy task, making it not very suitable for real-time systems.

2) ADVANCED 4D GRID-LIKE REPRESENTATIONS
Advanced grid-like representations have been proposed to overcome the issue of event overlapping, thus, retaining more information [22], [23].For example, TORE volumes [23] utilizes a first-in-first-out buffer at each spatial position to retain the temporal information of the last K events, for both polarity types, where K > 1.This results in a 4D representation with a resolution of 2×K ×H ×W , where H and W are the frame's height and width, respectively.By doing so, TORE volumes [23] limit the problem of event-overwriting which is often encountered in image-like representations.
Another notable representation is Event Spike Tensors (EST) [22].EST employs an end-to-end learning approach to derive event representations from input data.This is achieved by applying convolutional operations on a batch of events with a learned kernel comprising a multi-layer perceptron with two hidden layers.Then, the resulting convolutions are discretized, yielding a 4D grid-like representation with dimensions of 2×B×H ×W , where B is the pre-selected number of temporal bins.
Although these representations demonstrate remarkable performance in a multitude of tasks [22], [23], it is important to note that the choice of compatible deep learning architectures is somewhat limited.Consequently, an additional quantization step is often required to convert the 4D representation into a 3D format [22].An alternative approach involves splitting the 4D grid along the polarity dimension (first dimension) and employing multiple deep learning models in parallel to process the resulting outputs, or modifying the input layers of a deep learning model to accommodate the higher-dimensional input.However, both approaches may lead to higher memory and computational requirements due to the increased dimensionality of the inputs.

3) VOXEL GRIDS
Voxel grids offer a precise means of capturing the spatial and temporal characteristics of events.A voxel represents a 3D point, traditionally denoting the height, width, and depth coordinates in a 3D model.Combining these voxels creates a 3D structure known as a voxel grid.Voxel grids are widely used in 3D computer vision, especially for representing a LiDAR-generated point cloud [49].Similarly, it can be also used to handle sparse event data.Voxel grids are applied to event batches by converting the depth axis to a temporal axis using B temporal bins per event batch.This conversion is typically achieved through spatio-temporal quantization employing a designed sampling kernel.The resulting voxel grid has dimensions of B×H ×W , allowing it to retain the essential spatio-temporal relationships within the event batches [16], [21], [50].Accordingly, researchers have explored the application of voxel grids in various computer vision tasks, including optical flow estimation [16], [21], HDR video reconstruction [50], and object recognition [51].
Despite their advantages, the use of voxel grids poses two primary challenges.Firstly, generating voxel grids can be computationally demanding, especially when utilizing sophisticated sampling kernels.Secondly, the adoption of voxel grids may lead to high memory requirements due to the resulting increased input dimensionality, similar to the challenges with 4D representations discussed earlier.This issue becomes particularly prominent with high-resolution grids (i.e., a large number of bins B) and long batch periods.

4) GRAPH-BASED REPRESENTATIONS
Alternative to voxel-grids, events can be represented as graphs [20], [27], [52].Here, each sampled event in an event batch is treated as a vertex v i .These vertices v (also referred to as nodes) are then connected to each other using edges ε, based on a pre-defined spatio-temporal distance metric, forming the graph G.This approach similarly captures the temporal relationships within the event batch and offers compatibility with existing graph-convolutional networks (GCNs) [20], [27].Graph-based solutions provide flexibility in the processing of the event data, allowing for a natural way to incorporate their spatial and temporal information [20], [27], [52].Compared to traditional CNNs, GCNs exhibit significantly lower inference computational complexity [52].
Nevertheless, generating the graphs can be computationally demanding.This is particularly true when dealing with high-density event streams, resulting in a large number of vertices and edges [53].Consequently, it is often necessary to sample a subset of events from the batch to reduce storage and computational costs [20], [52].Moreover, unlike CNNs in traditional computer vision, there is limited availability of GCN models pre-trained on large-scale datasets.This hampers the ability to leverage transfer learning.As a result, researchers often develop their own GCN architectures to accommodate the generated graphs [20], [27], [52].

C. AUGMENTATION METHODS FOR EVENT-BASED VISION
Data augmentation techniques play a crucial role in enhancing the performance and generalization of deep learning models.Given the limited availability of labeled event-based datasets, augmentation methods offer an effective approach to expand the training data and improve model robustness.In this subsection, we provide an overview of the different augmentation methods proposed for event data.
Li et al. [54] propose several randomized geometric augmentations for training SNNs.These include common techniques such as horizontal flip, translation, and rotation; as well as other unique techniques such as cutout, shear, and CutMix.These transformations introduce variations and enhance model performance.Gu et al. [55] introduce Event-Drop, an augmentation framework for randomly dropping events within an event batch.It explores various eventdropping techniques, including dropping events within a random time period, pixel area, or a random portion of the sampled events.EventDrop improves robustness and has been evaluated for event-based object recognition.The authors also explore the use of EventDrop on different combinations of event representations and pre-trained classification models.EventMix [56] presents an advanced augmentation framework that uses a random 3D mask to mix different event-batch samples and their labels.This mixing technique enhances the diversity of the training data and has been evaluated on a set of event-based recognition benchmarks as well.Naeini et al. [57] propose spatial, noise, and time-series augmentations to improve contact-force estimation.Spatial augmentations include rotations and resizing.Noise augmentations add sequences of noise to the dataset, which are generated by recording similar sequences without any movement.Time-series augmentations include frame-shifting, which shifts all generated batch-representation frames within a given sequence; and temporal event shifting, where a fraction of events are randomly selected and removed from one frame and appended to an adjacent frame.For both types of timeseries augmentations, the authors explore a fixed index-shift range of +3 to −3.These augmentation methods, along with others proposed in the literature, contribute to addressing the dataset scarcity issue in event-based vision.By applying these techniques, models can better handle variations in event data and improve their generalization capabilities.However, despite their importance, event data augmentation techniques are still not thoroughly explored in the literature.

D. LITERATURE CONTRIBUTION
In this paper, we present the CSTR, an alternative imagelike representation for event-based vision.The CSTR offers a comprehensive representation of sparse event data when processed in batches while requiring minimal memory resources.It provides a choice that eliminates the need for manual parameter tuning and can be generated in an online manner.It is important to note that the CSTR is not meant to replace advanced or more sophisticated representations.Rather, it serves as an excellent representation choice for initial proof-of-concept and facilitates the rapid deployment of event-based solutions.This is due to the compact 3-channel image-like format of the CSTR, which enables the direct utilization of state-of-the-art computer vision architectures.
To validate the effectiveness of the CSTR, we conduct several experiments on various event-based recognition benchmarks comparing it to other image-like representations of similar complexity using various pre-trained classification networks.Additionally, we supplement our representation with several randomized augmentation methods that impact different components of events, including spatial, temporal, and polarity.These augmentation techniques further contribute to improving the performance and the generalization capabilities of event-based vision models.

III. METHODOLOGY
In this section, we present our proposed event-based representation.First, we provide a detailed overview of how events are generated.Then, we define the common and foundational image-like representations that form the basis of our work.These representations fundamentally encode the spatial and/or temporal components of events within the event batch.By analyzing the characteristics of these representations, we derive a more advanced spatio-temporal representation that enhances performance.We visualize these representations on the evaluation datasets in Fig. 2 (see: next page).Given that our approach aims to improve temporal context, we also introduce a novel temporal augmentation technique to address the sparseness of training data.

A. EVENT GENERATION MODEL
In contrast to traditional cameras, event-based sensors capture per-pixel brightness changes, asynchronously [6].At a given pixel (x, y), an event e is generated whenever the logarithmic change in brightness intensity exceeds a predefined contrast threshold C.This can be expressed as follows: where I (x, y, t) represents the intensity measurement at spatial position (x, y) at time t, and t represents the time duration since the last generated event at the same spatial position.The polarity p of an event is determined by the sign of the brightness change.A brightness increase (on event) is assigned p = +1, while a brightness decrease (off event) is assigned p = −1.Thus, p ∈ {+1, −1}.Event-based sensors report each captured event e i as a combination of a microsecond timestamp t i , a polarity p i , and a two-dimensional spatial coordinate (x i , y i ).In general, an event stream ε composed of n sequential events can be denoted as: Events can be grouped into batches either based on a specified batch-sampling period T or a fixed number of events.In this work, we focus on event batches accumulated using predefined batch periods to enable a synchronous response.
The event generation process outlined above captures the spatio-temporal dynamics of the scene.This is done by detecting changes in brightness intensity and encoding them as events with corresponding timestamps, spatial coordinates, and polarities.

B. FOUNDATIONAL EVENT REPRESENTATIONS
To represent a batch of events ε captured during a sampling period T , several image-like representations can be formed.We identify five foundational approaches identified in the literature: Binary Event Frame, Polarized Event Frame, Binary Event Count, Polarized Event Count, and Timestamp Image.While these representations are not typically referred to as Binary or Polarized, we use these terms to distinguish between them clearly.We detail these approaches next.

1) BINARY EVENT FRAME
The Binary Event Frame binarizes whether any events are detected at a given spatial location.Each pixel position in the resulting two-dimensional H ×W representation can be encoded as follows: where x i and y i are the spatial coordinates of each event e i in the batch ε.We encode the presence of an event as 1 and the absence of any as 0. This representation is visualized in Fig. 2, column one.Note how this approach is very simplistic and has low contrast; this is because it is highly sensitive to motion-overlapping, where multiple events occur at the same spatial location, as well as noise captured by the event camera.Accordingly, this representation suffers from frame saturation which results under almost any batch-sampling duration, as shown in Fig. 2.

2) POLARIZED EVENT FRAME
The Binary Event Frame can be extended to include polarity information.The Polarized Event Frame incorporates this in a 2×H ×W 3D matrix.The event batches are defined by: where x i and y i are the spatial coordinates and p i is the polarity of each event e i .We similarly encode detected events by 1 and the absence of events as 0, but for each polarity.This representation is visualized in Fig. 2 (second column), showing a notable contrast improvement.Similar to the Binary Event Frame, this representation also suffers from frame saturation.Accordingly, both Event Frame representations are more effective when generating batches based on a constant number of events (ideally a low number) instead of a fixed sampling duration [19].

3) BINARY EVENT COUNT
Alternative to the Binary Event Frame, the Binary Event Count representation captures the number of events at each spatial position.We encode this with the following equation: where n is the number of events.The Iverson bracket here would be equal to 1 if the expression is true, which is whenever an event has the same spatial location as the pixel (x, y).This representation retains more information about the scene at each spatial location.Moreover, as visualized in Fig. 2 (third column), this representation shows high temporal precision, albeit at the cost of less sharp contour details.

4) POLARIZED EVENT COUNT
Analogous to the Polarized Event Frame, the Binary Event Count can be extended to include event-polarity context.We similarly represent this with a 2 × H × W matrix as follows: where n is the number of events, x i and y i are the spatial coordinates and p i is the polarity of each event e i .This is visualized in Fig. 2 (fourth column), improving the contour details (though still not as sharp as the Polarized Event Frame).In contrast to the Event Frame representations, the Binary and Polarized Event Count representations do not suffer from frame saturation.Instead, they are robust to long batch-sampling durations, as shown in Fig. 2. Nevertheless, both Event Count representations require significant motion overlap and high event-density streams to yield a meaningful signal.

5) TIMESTAMP IMAGE
An alternative approach to tracking the number of events is to identify the most recent timestamp instead.This is achieved using the Timestamp Image representation [42], which is a 3D matrix of size 2×H ×W .Assuming that the batch's events are sorted in chronological order (i.e., from oldest to newest) we obtain this representation as follows: where t s is the raw time offset representing the start of the event batch with temporal duration T , and t i is the timestamp of the event e i .In (7), T s (x, y, p) represents the normalized timestamp (in the range of [0, 1]) of the latest event occurring at the pixel location (x, y) and polarity p.The subtraction of t s removes the time offset from each event's timestamp.This representation is visualized in Fig. 2 (fifth column), where the normalized recent timestamp further improves contour details over the naive Event Frame representations.Note, however, that this improved contrast diminishes under high-density event streams with long batch periods.Additionally, the Timestamp Image is also susceptible to noise in more recent events.

6) COMBINING TIMESTAMP IMAGE AND EVENT COUNT
Given the inherent limitations of the Timestamp Image and the Event Count representations, combining them can enhance their robustness [47].To achieve this, we concatenate the two-channel Timestamp Image T s , defined in (7), with the normalized one-channel Binary Event Count.The normalized Binary Event Count Ĉbin is defined as follows: where max(C bin ) is the maximum event count in the frame.This combination results in a 3 × H × W 3D matrix, as visualized in Fig. 2 (sixth column).While the addition of the event-count information improves the contour details, the contrast of the recent timestamp channels is still affected by long batch periods with high event density.

C. COMPACT SPATIO-TEMPORAL REPRESENTATION
The combined Timestamp Image and Event Count representation is generally robust but can lose temporal context with motion-overlapping.A recent timestamp is most useful when the event data is temporally sparse; however, can lose general temporal context when there are many overlapping events.This bias can happen frequently when subjected to highly textured scenes or long batch periods.To address this, we introduce the compact spatio-temporal representation (CSTR).
The CSTR improves the timestamp information by utilizing the mean timestamp instead to better capture temporal context.Thus, we initially accumulate the normalized timestamp values of all events at each spatial position as follows: where S(x, y, p) represents the sum of the normalized event timestamps at position (x, y, p).Then, we calculate the mean of events' timestamps by dividing ( 9) over ( 6) as follows: where Ts (x, y, p) represents the mean timestamp at position (x, y, p).This is visualized in Fig. 2 (seventh column).Nevertheless, mean timestamps on their own can be insufficient to represent the event data.Incorporating the event count can provide vital event-overlap context.Therefore, we concatenate the 2-channel mean timestamp Ts , defined in (10), with the normalized Binary Event Count Ĉbin , defined in (8).This yields a 3-channel representation.We visualize the CSTR in Fig 2 (last column), showing that it retains strong temporal context and contour sharpness.Hence, the CSTR approach adds robustness to motion-overlapping while retaining direct compatibility with existing computer-vision networks.

D. EVENT-BASED DATA AUGMENTATION FRAMEWORK
Randomized data augmentations can improve the generalization of deep learning models.Further, they can complement the spatio-temporal representations in event-based solutions.Accordingly, we propose a simple framework for randomized event-data augmentations that affect the spatial, temporal, and polarity information of event data.These augmentations can be combined and applied when training an event-based deep learning model with a spatio-temporal representation.

1) SPATIAL AUGMENTATIONS
Spatial augmentations are a common solution for introducing variations across the spatial dimension.In our framework, we explore a combination of rotations, rescalings, crops, and horizontal flips, each with its own parameters to set.For optimal computational efficiency, we apply spatial augmentations to the generated image-like event-batch representations.

2) TEMPORAL AUGMENTATIONS
Rich temporal information is a major component of event data.Temporal augmentations can help enhance a model's ability to handle temporal dynamics.This is vital for representations that incorporate temporal information (e.g., Timestamp Image [42]).As illustrated in Fig. 3, events are shifted based on a randomized value λ within the range of [−1, +1], which is generated per event batch sample ε.This dynamic but consistent temporal shifting allows the model to learn from different temporal perspectives and improves its robustness to varying temporal dynamics.The temporal shift for each event e i in the event batch ε can be expressed as: where t ′ i is the shifted timestamp of event e i , θ t is the max temporal shift threshold (θ t ∈ (0, 1)), and T is the batchsampling period.A balanced value for the max temporal shift threshold θ t is 0.5, which indicates that the batch's events can be only shifted by a max of T 2 in either direction (shown in Fig. 3).Then, we filter out any events that fall outside the original batch's temporal range of [0, T ].Note that the proposed temporal augmentations are applied to a given event batch ε before generating an image-like representation.

3) POLARITY AUGMENTATIONS
Polarity augmentations introduce variations across the polarity domain, enabling the model to learn from varying polarity correlations of events.In our framework, we adopt a simple approach of inverting all the polarities in an event batch prior to frame transformation.This polarity inversion typically implies the reversal of the direction of motion and can introduce robustness to variations in lighting and motion.Hence, for each event e i in an event batch ε, the polarity p i is inverted to pi if the threshold θ p is met.The threshold θ p is ideally set to 0.5, indicating a 50% chance of inverting the polarities of a given event batch ε.Similar to the proposed temporal augmentation method, the polarity augmentations are applied before generating the image-like representation.

TABLE 1.
Statistics of the event-based object and action recognition datasets used in our experiments.The symbol † indicates that the referenced dataset does not have an official test split, while ‡ denotes that the dataset's original sequences were divided into samples of 500 ms with a 250 ms step size (following [58]).

IV. SETUP
In this section, we evaluate the proposed event-based representation for object and action recognition.Our primary comparison is evaluating our proposed event representation, the CSTR, against the foundational representations defined in the methodology (Section III-B).We do this over a series of well-known datasets to demonstrate our improvements in recognition tasks.Next, we take the best-performing spatio-temporal representations and do a second comparison while employing our proposed augmentation framework.Our experimental setup, including the network structures, datasets, augmentations, and training parameters are introduced next.

A. EXP I: BASELINE REPRESENTATION EVALUATION
In the baseline experiment, we compare the CSTR against the six foundational event representations presented in Section III-B.Recall that the Event Frame representations are traditionally encoded as either 0 or 1, while the foundational Event Count representations are encoded as the number of events (without scaling).However, the Event Count channel associated with the combined Timestamp Image & Event Count and the CSTR is normalized.This is done by dividing each event-count value by the maximum number of events in the frame as defined in (8).We apply this because the temporal representations are already scaled to be in the [0, 1] range.
We add rigor by exploring three-channel configurations for the one-and two-channel representations.We do this to enable direct compatibility with the classification networks' input structures and better leverage their pre-trained weights.In the case of the one-channel Binary Event Frame and Binary Event Count, we replicate the resulting channel three times.In the case of the Polarized Event Count, Timestamp Image, and the CSTR with mean timestamps only, we append an empty channel of zeros of the same spatial dimensions.Lastly, for the two-channel Polarized Event Frame, we first convert to an intermediary one-channel representation, where positive and negative events are denoted by values of +1 and −1 (following the approach proposed in [18]).We then replicate this three times instead of padding with a channel of zeros.These configurations are determined through experimentation to yield optimal results for each representation.

1) EVENT-BASED RECOGNITION DATASETS
Several event-based object and action recognition datasets are available in the literature.In this work, we utilize four commonly used event-based datasets to evaluate our proposed methods for object recognition: N-MNIST [24], N-Cars [25], N-Caltech101 [24], and CIFAR10-DVS [26].Additionally, we evaluate our methods on two action recognition datasets, namely ASL-DVS [27] and DVS-Gesture [28].In Table 1, we provide an overview of the main details and statistics of the selected recognition datasets.
For object recognition, all datasets except N-Cars [25] are effectively event-based versions of their frame-based counterparts commonly used in conventional computer vision.These datasets are generated using an event-based sensor, such as the DVS-128 [6] or the ATIS [7], mounted on a platform that moves in parallel to a screen displaying image samples of each dataset.The platform is programmed to move at various velocities and motions to simulate events similar to real-world sensor data.N-Cars [25], on the other hand, was generated using an event camera mounted on a moving vehicle driving on real-world roads.The dataset consists of events captured by the event camera as the vehicle encounters different objects, including cars and pedestrians, in various driving scenarios.
For action recognition, ASL-DVS [27] consists of 24 hand shapes resembling different letters from the American Sign Language.These shapes were recorded in an office environment with constant illumination using DAVIS240c [8].For each letter, 4200 samples were collected at a sampling duration of 100 ms.Meanwhile, DVS-Gesture [28] consists of 1342 event-data sequence recordings of 11 different gestures.These sequences were captured under three lighting conditions and performed by 29 individuals.Due to the considerable length of the dataset's sequences (∼100 seconds on average), we divide each into shorter samples of a fixed batch-sampling period.Initially, each sequence is split into a subsequence per gesture.Then, the resulting subsequences are further divided into 500 ms samples with a 250 ms step size, following a similar approach used in previous works [20], [51], [58].The resulting number of samples is presented in Table 1.
Except for DVS-Gesture [28], we use the provided samples with pre-defined batch periods T from each dataset, as outlined in Table 1.The sampling periods range from 100 ms (N-Cars [25] and ASL-DVS [27]) to roughly 1300 ms (CIFAR10-DVS [26]).This enables us to analyze the robustness of different event representations to various batchsampling periods.
Furthermore, Table 1 demonstrates an uneven distribution in the average number of samples per class across the datasets.N-MNIST [24], N-Cars [25], ASL-DVS [27], and DVS-Gesture [28] exhibit a substantial number of samples per class facilitating effective training and fine-tuning of classifiers.In contrast, CIFAR10-DVS [26] and N-Caltech101 [24] have significantly fewer average numbers of samples per class of 1000 and 81, respectively.While the samples of CIFAR10-DVS [26] are uniformly distributed among classes, the samples N-Caltech101 [24] are highly unbalanced, ranging from 31 to 800 samples per class, posing a challenge for object recognition tasks.
For datasets without an official test split (N-Caltech101 [24], CIFAR10-DVS [26], and ASL-DVS [27]), we adopt the 80%-20% training-testing dataset-split strategy employed in similar works [20], [25], [51].These splits are generated once and utilized consistently throughout the experiments of this work to ensure consistent benchmarking and fair comparisons.In addition, to address the imbalance in the sample distribution within N-Caltech101 [24], we apply the same split ratios to each class's samples.This approach avoids imbalanced splits and maintains a fair and consistent benchmarking process across the different methods evaluated in this work.

2) CLASSIFICATION MODELS
We evaluate each event representation using six popular pre-trained CNN image classifiers.We do this both for completeness and to represent real-world use.These classifiers include: ResNet18 [15], ResNet50 [15], MobileNetV2 [59], both Small and Large variants of MobileNetV3 [60], and InceptionV3 [61] (limited to 3-channel representations only).We initialize all networks with weights pre-trained on ImageNet [62].Then, we replace the final fully connected layer with a corresponding layer that matches the number of output classes in the utilized dataset.For representations with 1 or 2 channels, we replace the initial input convolutional layers of each CNN classifier with randomized weights to accommodate the desired number of input channels.Subsequently, we fine-tune these networks on the evaluation datasets.Throughout our experiments, we observed that utilizing the frame-based architectures as-is (i.e., for 3-channel representations) yields better results due to more effective fine-tuning.Consequently, whenever possible, we present either a replicated or an extended 3-channel version of all tested representations.

3) TRAINING PARAMETERS
For all models trained in this work, we use the cross-entropy loss with the ADAM [63] optimizer (without weight decay), for up to 50 epochs.We utilize an initial learning rate of 1 × 10 −3 for N-MNIST [24], N-Cars [25], and ASL-DVS [27]; and 3 × 10 −4 for the more challenging N-Caltech101 [24], CIFAR10-DVS [26], and DVS-Gesture [28].While more advanced learning rate schedulers can be employed, we avoid them to limit the number of hyper-parameters and simplify the comparison.
During training, each batch-representation sample is initially generated with a resolution matching the spatial dimensions of the utilized dataset (as shown in Table 1).The resulting 3D representations are then scaled to 224 × 224 for all classifiers, except for InceptionV3 [61] which requires a 3-channel input with the spatial dimensions of 299 × 299.After rescaling, we apply standardization to the resulting 3D matrices using normalization parameters derived from ImageNet [62] (i.e., mean and standard deviation).Our experiments (using the CSTR with the object recognition datasets) consistently show an average classification accuracy improvement of approximately 5% when utilizing ImageNet normalization parameters.This improvement is observed compared to using each dataset's distribution parameters or when not applying normalization.It can be attributed to the suitability of ImageNet parameters for generalizing image-like representations.This is particularly important given the relatively low number of samples of the event-based datasets used in our experiments, compared to ImageNet [62], making them less optimal for removing input bias through standardization.
Furthermore, we randomly split the training set by 75% for training and 25% for validation.In addition, to ensure proper convergence and robust generalization, the samples of the validation split are randomly selected per each class's number of samples.This ensures a more balanced and well-representing validation set.For all models trained in the baseline experiment, we use early stopping to prevent overfitting.Specifically, we monitor the validation loss during training, and if it does not improve for 10 consecutive epochs, we stop the training early to avoid further overfitting.Afterward, we choose the model with the lowest validation loss that results during training.We follow the same procedure when not utilizing early stopping as well.Finally, we use a batch size of 64 for all the models we train throughout this work.TABLE 2. Average test classification accuracy results for the foundational event representations and the CSTR across different recognition datasets.Each result is the average of up to 6 classification models as specified in Section IV-A2.Note that the 1 and 2-channel representations are additionally transformed into 3-channel representations as specified in Section IV-A, and indicated by the * .The best and second-best results are highlighted in bold and underlined, respectively.

B. EXP II: RANDOMIZED EVENT AUGMENTATIONS
With a baseline established, our next experiment aims to leverage the randomized augmentation framework introduced in Section III-D.Augmentations are a popular method for addressing data sparsity as they introduce variance in the spatial, temporal, and/or polarity characteristics.We believe these effects can also be used to further investigate batch-representation stability and explore how well the performance of spatio-temporal representations scales with the proposed randomized event-based augmentation framework.
In this experiment, we explore different settings for each type of randomized augmentation (spatial, temporal, and polarity).For spatial augmentations, we apply crops, rotations, and translations to the generated image-like representations.Initially, we randomly take crops of 90-100% of the spatial frame size with aspect ratios ranging from 3:4 to 4:3.We also apply translations of up to 10% in the x and y axis (up to 5% for N-Cars [25]) and rotations of up to ±10 • (up to ±30 • for N-MNIST [24]).Additionally, random horizontal flips are used with CIFAR10-DVS [26] (applied prior to the other spatial transformations) with a threshold of 0.5.For both temporal and polarity augmentations, we utilize a balanced value of 0.5 for both the maximum temporal shift θ t and the polarity inversion thresholds θ p .We note that all of the proposed randomized augmentations are only applied to the training splits (i.e., excluding validation splits).
Furthermore, we explore different combinations of the proposed augmentation methods.Spatial augmentations can be highly beneficial as spatial dependencies are typically the most informative, especially when identifying the edges or contours of an object.However, when utilizing event data, they require careful manual tuning.On the other hand, the proposed temporal and polarity augmentations have minimal parameters to tune and can naturally complement the training of any event-based solution.Therefore, we focus on the temporal-polarity augmentation combination as an alternative that requires no tuning when using their default threshold values.Finally, for a more comprehensive approach, we explore a combination that incorporates all three event-based augmentation methods.
We perform this experiment only on the spatio-temporal representations presented in this work.This includes the proposed 3-channel variants of the CSTR and the Timestamp Image.These representations are selected because the proposed framework primarily affects the temporal and polarity information of event data, making them optimal for spatiotemporal representations.Additionally, we only utilize the three best classifiers found during the baseline experiment: ResNet18 [15], ResNet50 [15], and InceptionV3 [61].The ASL-DVS [27] dataset is excluded from this experiment as its performance is already effectively saturated without the use of augmentations.Finally, we provide sufficient training time to ensure reaching an optimal global minimum, by training each model for 50 epochs without early stopping.We use an initial learning rate of 1 × 10 −4 instead while keeping all the other evaluation parameters identical to the initial experiment.

V. EVALUATION RESULTS
In this section, we present our experimental results.We first do a baseline evaluation of the CSTR and six foundational representations across popular event-based recognition datasets.We then identify the best performers and re-evaluate them when using the proposed augmentation framework.These experiments help show that the proposed CSTR is a robust means of representing event batches, including ones with long temporal durations and high event density.Finally, we present a comparison with other works in the literature.

A. EXP I: BASELINE EVALUATION RESULTS
We present the baseline evaluation results in Table 2.This table shows the average performance of the representations with all six classification networks detailed in the Experimentation Setup (see: Section IV-A2).For space reasons, TABLE 3. The effects of the proposed event-based augmentation framework on the average test classification performance of the different spatio-temporal representations explored in this work.Each result represents the average classification accuracy of the top three classifiers only (ResNet18, ResNet50, and InceptionV3) due to the complexity of training with augmentations.The first row represents the baseline results obtained without any augmentation, serving as a reference point for each representation.The subsequent rows demonstrate the performance improvements achieved when using the respective augmentation configurations.Notably, only the augmented three-channel representations are considered, as outlined in Section IV-A and indicated by the * .The best-performing baseline representation is indicated by the † , while the representations yielding the best and second-best performance with augmentations are highlighted in bold and underlined, respectively.
we provide a full breakdown of each network's performance in Table 5 of the Appendix A. We note a few basic observations.First, including polarity improves generalization.We see this mainly in the Event Frame representations, as well as the Event Count representations but to a lesser extent.This aligns with the methodology expectations.Second, there is a benefit to maintaining the classification networks' native input structure.In all cases, transforming a one or two-channel representation into three channels (by either padding or replicating data) consistently improves classification accuracy.This reinforces the value of transferlearning frame-based networks for event-based applications.Lastly, our representation, the CSTR, has the highest average classification accuracy and is the best overall in four of the six datasets.
The strength of the CSTR is in addressing motionoverlapping.We can see that of the foundational event representations, the simple Binary Event Count is rather robust.This implies that the number of events per batch is strongly correlated with the classification task, where adding polarity helps better describe the type of motion.Intuitively, this implies that better describing the event's temporal distribution should improve performance.While the Timestamp Image does this via recent timestamps, this approach can be biased for longer temporal periods.The CSTR addresses this by representing the aggregate behavior with the mean timestamp and generalizes very well across datasets, including those with long temporal durations and high event density.
We note the results get particularly interesting with the CIFAR10-DVS [26] dataset.In general, all classification networks for all representations notably overfit.This overfitting concern is verified by the simple Binary Event Count having the highest dataset classification accuracy, remaining in line with its accuracy on other datasets.We believe this overfitting is partially due to the dataset being generated by repeated back-and-forth motions (frequent direction change), causing very significant motion overlap [26].Furthermore, the CIFAR10-DVS [26] data collection methodology uses upscaled 32 × 32 RGB images that appear rather blurry [26].This blurriness reduces the edge features the events depend on and inherently increases sensitivity to sensor noise.With this said the CSTR still does relatively well, but incrementally worse than the Timestamp Image representations.We hypothesize here that the timestamp recency better correlates with back-and-forth motions versus the timestamp mean.
Lastly, we observe that the optimal classification network can vary across representations and datasets.Intuitively, classification network accuracy should correlate with ImageNet accuracy; however, the expanded results given in Appendix A (Table 5) show that this is not always the case.We conjecture that this can be a function of dataset density and intra-class variance.When the variance is particularly high, such as in the CIFAR10-DVS [26] dataset, the smaller networks tend to generalize better.This is likely a result of overfitting, where the smaller parameter spaces inherently regularize themselves.However, we also note the large InceptionV3 [61] network is still the top performer for some representations.

TABLE 4.
Comparison with the self-reported state-of-the-art works.Our proposed representation, the CSTR, yields very competitive results when compared with state-of-the-art event-based object and action recognition on the utilized datasets.For datasets without an official split, the † symbol denotes that the referenced result was based on a 90%-10% split, compared to the typical 80%-20% split.The best and second-best results are highlighted in bold and underlined, respectively.
This implies picking the optimal network may ultimately require experimentation.We recommend that the developer assess various networks and select the one that best fits their accuracy and run-time requirements.

B. EXP II: RANDOMIZED AUGMENTATIONS RESULTS
We present the results of the augmentation evaluation in Table 3. Starting with the baseline results, we observe that the CSTR consistently outperforms other representations when considering the top-3 classifiers (ResNet18, ResNet50, InceptionV3) on most datasets.This emphasizes the robustness of the CSTR in capturing spatio-temporal information across varying batch periods.The slight underperformance of the CSTR on the N-Cars dataset compared to the Timestamp Image representation can be attributed to the dataset's low event density and short batch periods.This causes larger classification networks to underfit with more complex representations.We observe this with DVS-Gesture as well.Nevertheless, the introduction of the proposed augmentations highlights the limitations of the Timestamp Image.Specifically, the CSTR demonstrates superior results on N-Cars when utilizing either the temporal-polarity augmentation combination or combining all three augmentation methods.This highlights the CSTR's ability to encode spatio-temporal information optimally when provided with sufficient training variations.
Overall, the augmentation framework shows significant performance improvements across all benchmarks.When using a single augmentation method, the proposed temporal augmentation method can match and even exceed the performance of hand-crafted spatial augmentations.This is evident in the highest average performance achieved by a single augmentation method (i.e., 91.2% when using the CSTR).We find that the CSTR benefits the most from the temporal augmentations due to its effectiveness at encoding temporal information.On the other hand, spatial augmentations, while generally reliable, have limitations on datasets with challenging spatial characteristics like N-Caltech101 [24].Furthermore, spatial augmentations require manual tuning for optimal results.In contrast, the proposed temporal and polarity augmentations serve as a promising alternative, requiring minimal tuning and consistently outperforming spatial augmentations on average across all evaluated representations.This makes them particularly advantageous for optimizing deep learning models in eventbased applications.
Interestingly, we find that combining all augmentation methods (spatial, temporal, and polarity) does not consistently yield the best performance.The significant variations introduced by this combination can lead to underfitting, considering the utilized regularization approach.Therefore, we suggest exploring an alternative approach of randomly selecting one of the augmentation methods per event-batch sample during training.Additionally, we observe that spatial augmentations underperform polarity and temporal augmentations on the N-Caltech101 [24] dataset.This can be attributed to the dataset's imbalance, where typical spatial augmentations are insufficient to improve generalization.
In conclusion, our findings demonstrate the strength of the CSTR and its ability to leverage the proposed augmentation framework.The temporal augmentations prove to be the most advantageous on average for the CSTR, showcasing the CSTR's effectiveness in capturing temporal information.Moreover, combining multiple augmentation methods can enhance generalization performance.However, further exploration and optimization of the augmentation methods are necessary to maximize performance and address limitations.

C. COMPARISON WITH THE STATE-OF-THE-ART
In this section, we compare the performance of the CSTR with other approaches that utilize the same recognition datasets.Although each approach utilizes different methods and training configurations, our aim here is to highlight the efficacy of the CSTR when combined with off-the-shelf pretrained classification networks.Furthermore, we emphasize how the performance can be further improved by leveraging the proposed augmentation framework for event data.
We present the performance comparison in Table 4.While most works report results for an 80-20% split, we provide the results of our framework on a 90-10% split for CIFAR10-DVS [26] as well to establish a fair comparison with those that utilize such a split.For our results on DVS-Gesture [28], we adopt a simple moving-majority filter to handle the long-term temporal dependencies, as applied in [23], [58].This filter outputs the most frequent gesture classification out of the last 5 (i.e., 1250 ms moving window).If there is more than one gesture with the same number of classifications (or none), the filter simply returns the classification result for the current event batch.It is worth noting that all the referenced works also utilize a 500 ms sampling period for splitting the event sequences of the DVS-Gesture [28] dataset.
Overall, the results show that the CSTR performs excellently across the employed benchmark datasets.In terms of the baseline performance (excluding augmentations), the CSTR notably achieves state-of-the-art results on CIFAR10-DVS [26] and consistently ranks as the second-best on ASL-DVS [27].This demonstrates the robustness and versatility of the CSTR which requires minimal configuration and enables a direct and effective deployment for event-based solutions.
To demonstrate the impact of the proposed augmentation framework, we compare the results with other works that incorporate different augmentation techniques for event data.One such work utilizes EventMix [56] augmentations in combination with the Polarized Event Count representation.This work splits the provided batch samples of the N-Caltech101 and CIFAR10-DVS datasets into 10 slices of equal temporal duration.This effectively yields 10 times the original number of samples of each dataset.In contrast, we utilize the provided batch samples of each dataset as-is.Despite this, the CSTR with the randomized Temporal-Polarity augmentations proves to be highly competitive, even without splitting the datasets' samples.Accordingly, the CSTR demonstrates significant robustness to varying batch periods.Furthermore, we show that the CSTR, in combination with the proposed temporal and polarity augmentations, can achieve stronger results on N-Caltech101 [24] even with less training data.Lastly, the addition of the augmentation framework significantly improves the performance of the CSTR, surpassing more advanced representations such as EST [22] with the EventDrop [55] augmentation framework.
Our findings highlight the strength of the CSTR representation when combined with off-the-shelf pre-trained classifiers.They showcase the effectiveness of the CSTR in capturing temporal information and leveraging the robustness of pre-trained networks without any modification to the input layers.Thus, the CSTR retains a compact input dimensionality and effectively leverages transfer learning.Furthermore, the proposed augmentation framework offers a promising alternative for enhancing generalization performance without the need for significant manual tuning.Finally, we note that the results presented utilize a simple training framework.Therefore, various training optimization and batch-sampling techniques can be explored to further improve robustness.

VI. CONCLUSION
In this work, we introduce the compact spatio-temporal representation (CSTR) for event-based vision.When dealing with asynchronous event data, it is common to accumulate events in batches to generate a synchronous response.In order to do so, an intermediate representation is needed, especially when utilizing modern computer vision architectures.Thus, encoding the data into a representation compatible with existing classification networks is crucial for leveraging transfer learning and avoiding the complexity of designing custom deep-learning architectures.Foundational event representations typically encode either the number of events or the most recent event's timestamp per spatial location (based on polarity).These approaches are convenient and relatively robust but can be sensitive to motion-overlapping (common in long sampling duration) and possibly deficient for high eventdensity streams.
The CSTR improves upon the foundational event representations by better describing the temporal behavior of the asynchronous event data while retaining similar computational complexity.This is done by calculating the average of the normalized timestamps per each event polarity, combined with the polarity-agnostic number of events at each spatial index of the frame.Besides, the CSTR imposes minimal processing overhead given that each event is only processed once and that each spatial position is updated independently (i.e., without the need to maintain any spatial dependencies), as indicated in the methodology.Accordingly, the CSTR generates a compact image-like representation that is more robust to high-motion scenes and long temporal durations.We validate this hypothesis through rigorous benchmarking against similar representations.
Combining the CSTR with off-the-shelf pre-trained classifiers demonstrates its ability to effectively leverage the power of transfer learning without modifying the input layers, thereby retaining its compact input dimensionality.We also propose a simple yet effective augmentation framework for event data, significantly improving the performance and generalization capabilities of the CSTR.This framework highlights the potential of augmentations in event-based recognition without the need for extensive manual tuning.
Experimental validation confirms that the CSTR outperforms foundational event representations in popular eventbased applications.Benchmarking the CSTR against six foundational representations and six common recognition datasets (using six popular classification networks) consistently shows its superior performance.Additionally, incorporating random augmentations during training, including our proposed temporal augmentation, further enhances results on all representations, with the CSTR generally benefiting the most from the proposed augmentation framework.This overall improvement validates the CSTR's ability to robustly encode temporal information.
The CSTR achieves our goal of providing a robust event-batch representation that is directly compatible with existing computer vision architectures, maintaining similar inference complexity.As a result, the CSTR is an excellent choice for developing event-based solutions.The combination of the CSTR with the proposed augmentation framework further enhances its performance and generalization capabilities, requiring minimal tuning and enabling direct deployment.
While the CSTR excels as a versatile representation, it does not directly address certain prominent challenges in event-based vision, such as sensor noise [43].To mitigate these issues effectively, additional techniques may be necessary.
Future work involves exploring the use of the CSTR in other perception tasks, such as object detection, and investigating additional optimization techniques to enhance robustness.Additionally, evaluating the suitability of the CSTR for real-time applications, where latency is a primary concern, would be an interesting avenue to explore.

APPENDIX A SUPPLEMENTARY TABLES
In Table 5, we provide a full breakdown of the first experiment's results (presented in Table 2).This experiment rigorously compares the CSTR against different foundational representations using the event-based object and action recognition datasets utilized in this work.Accordingly, the result of each of the six pre-trained classification networks, specified in Section IV-A2, are shown.Additionally, in Table 6, we provide a full breakdown of the second experiment's results (presented in Table 3).This experiment evaluates the effects of the proposed event-based augmentation framework on the CSTR and the other spatio-temporal representations explored in this work.The results are shown for each of the three-best classification networks utilized in this work (ResNet18 [15], ResNet50 [15], InceptionV3 [61]).Authorized licensed use to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 2 .
FIGURE 2. Visualizations of the CSTR as well as the foundational event representations investigated in this work using various object and action recognition datasets.To enable visualization, we normalize the Binary and Polarized Event Count representations.Further, due to the significant event noise present in the N-Caltech101[24] samples, we amplify the event count channels by a factor of 20 to improve visualization.This is shown in the 3rd row, columns 3, 4, 6, and 8.

FIGURE 3 .
FIGURE 3. Illustration of the proposed temporal augmentation method.Spatio-temporal events within a given batch are uniformly time-shifted by a randomized value λ multiplied by T .Events that fall outside the original temporal range [0, T ] are subsequently removed.The maximum temporal shift θ t that is demonstrated here is ±50% of the batch duration T .

TABLE 5 .
A full breakdown of the test classification accuracy results that are presented in Table2for the event-based recognition datasets.The results of each evaluated representation configuration are demonstrated for 6 different pre-trained classification networks which are fine-tuned on each dataset.The best values are highlighted in bold per dataset and number of input channels.

TABLE 6 .
A full breakdown of the test classification accuracy results that are presented in Table3for the event-based recognition datasets.The results of each evaluated spatio-temporal representation are demonstrated for the top-3 pre-trained classification networks which are fine-tuned on each dataset.The best values are highlighted in bold per dataset.102914VOLUME11, 2023