An End-to-End Broad Learning System for Event-Based Object Classification

Event cameras are bio-inspired vision sensors measuring brightness changes (referred to as an ‘event’) for each pixel independently, instead of capturing brightness images at a fixed rate using conventional cameras. Asynchronous event data mixed with noise information is challenging for event-based vision tasks. In this paper, we propose a broad learning network for object detection using the event data. The broad learning network consists of two distinct layers, a feature-node layer and an enhancement-node layer. Different to convolutional neural networks, the broad learning network can be extended by adding nodes into layers during training. We design a gradient descent algorithm to train network parameters, which creates an event-based broad learning network in an end-to-end manner. Our model outperforms state-of-the-art models, specifically, because of the small scale and increased speed displayed by our model during training. This demonstrates the superiority of event cameras towards online training and inference.


I. INTRODUCTION
Object classification is an important and fundamental task in image and video understanding, which has been investigated for decades and this research primarily focused on framed-based vision formats. In identifying superior performance, research has highlighted Convolutional Neural Networks (CNNs) [13], [24], [25], [31], [45] as the domination for image-based object classification task. However, frame-based imaging is not the only option for computer vision, and this format always has a number of shortcomings, presenting challenges for further development [33]. For example, the limited frame rate, high redundancy between frames, blurriness due to slow shutter adjustment under varying illumination, and high-power requirements [15] have all stifled progress within the field.
With the advent of bio-inspired vision sensors, neuromorphic cameras [3], [43], [49] are increasingly being used to explore the aforementioned phenomena. These devices essentially attempt to emulate biological retina functioning, and particularly event cameras, such as the Asynchronous The associate editor coordinating the review of this manuscript and approving it for publication was Jiahu Qin .
Time-based Image Sensor (ATIS) [43] and Dynamic Vision System (DVS) [16]. These technologies are fundamentally different from traditional cameras which sequence frames at fixed intervals. The term 'event' refers to an output spike, characterized by a specific spatial location (x, y), timestamp (t) and brightness change polarity (p), shown in Fig. 1. This means that the output from an event camera is a stream of asynchronous spikes, triggered by brightness changes sensed by individual pixels. Thanks to their logarithmic sensitivity to illumination changes, event-based cameras also have a much larger dynamic range, exceeding 120dB [43]. The growth in popularity of these types of sensors, and the distinct advantages, have enabled researchers to implement event-based vision for a variety of applications, such as object tracking [17], [21], [37], [44], visual odometry [28], [48], [56], and optical flow estimation [2], [57].
Recently, several models have been proposed to utilize event-based data for object classification, e.g., CNNs [27], Hierarchical like models [20], and Deep Belief Networks (DBNs) [41]. Other methods have begun to explore the integration of dynamical information into recognition using motion-direction sensitive units [26] or dynamical networks such as echo-state networks [32]. However, the networks mentioned are inferior to their frame-based counterparts because asynchronous event-based data have different characteristics. Event-based cameras, frequently only output scene noise, while conversely sometimes generating large amounts of information. In order to address this inconsistent output problem, we adjusted the frame-based CNNs model to cope with asynchronous data. However, the accuracy of event-based object classification algorithms remains far behind frame-based technologies which may be explained by the lack of effective event representations which are used to accurately depict spatio-temporal characteristics of eventbased data. The operation used to process RGB-based images in a high resolution may also be unsuitable for the asynchronous event-based data. Also, specific network information is absent for event-based object classification. Although, the network should be able to handle asynchronous event data, and could theoretically be fully trained at both the small and large numbers of samples. Therefore, it is necessary to design a flexible neural network for event-based data which will not become less accurate or less efficient.
Here we propose an end-to-end learning framework that address both problems. First, we introduce a bioinspired mechanism, namely Peak-and-Fire Mapping (PFM), to record the peak of the stream of the asynchronous event data. The event peak encodes the dynamic information of one pixel and neighboring cell structures. Compared to previous approaches, PFM is robust to spatio-temporal noise. Second and significantly, we propose to use a broad learning network to handle PFM representation. The Broad Learning System (BLS) [9] is a novel structure established based on the random vector functional-link neural network (RVFLNN) [42] which inherits features, and can be expanded substantially. Contrary to some popular deep neural networks which suffer from a time-consuming learning for excessive parameters, BLS can provide a faster scheme with higher accuracy. These characteristics make the BLS very efficient and much less time-consuming in classification and regression problems. Nevertheless, the original BLS cannot be applied to online learning systems directly because the original learning algorithm is a one-shot scheme based on the ridge regression approximation of pseudoinverses which are not compatible with weight updates in feature or enhancement nodes. Hence, we designed an iterative learning algorithm which adjusts parameters within the nodes of all layers by a gradient descent algorithm. This step ensures the the entire BLS is an end-to-end learning system, capable of adapting, unlike any of the previous models.
We designed a broad learning network to deal with event-based data for object classification. The broad network provides an alternative way of learning, distinct from the previous deep CNNs models. The key contributions of this paper are: 1. BLS integrates asynchronous data into a flexible broad network; 2. A gradient descent algorithm is proposed to train the BLS in an end-to-end manner. This was implemented to ensure our BLS model is faster and can outperform state-ofthe-arts in terms of accuracy. The proposed flexible network structure is able to be fully trained in large-and small-scale training samples which demonstrates that event cameras are superior in nearly online training and inference applications.

II. RELATED WORK
In this section, we review the most relevant research into event-based object classification, including event-based feature and object classification, event-based networks, and functional link neural networks.
The majority of past research into event-based object detection focuses on detecting and tracking stable features, and the applications include: simultaneous localization and mapping applications [28], [44], corner detectors [39], [54], edge and line extraction [47], event-based flow [12]. Recently, several studies have proposed to use standard learning architectures to extract event features for object classification. Lagorce et al. [33] proposed an hierarchical representation based upon time surface definitions and clustering time surfaces at each layer, while the final layer sends outputs to a classifier. The main limitations of this method is high latency, due to increasing time required to compute time surfaces and the high computational cost of the clustering algorithm. A compact, faster representation of the [33] was proposed by Sironi et al. [52]; however, this method and the methods derived from this model focus on extracting accurate hand-designed features from event data. The classifiers or networks were not considered to exist within a broader framework consisting of asynchronous data. Therefore, these methods lack noise sensitivity and have clear dependencies according to different types of object motion within each scene. Networks used in event-based data originate from training artificial neural networks by reproducing or imitating learning rules observed in biological neural networks [4], [23], [38], [50]. These patterns are similar to the processes involved in frame-based computer vision, which try to optimize network weights by minimizing smooth error functions.
The most commonly used architectures for event-based cameras are Spiking Neural Networks (SNN) [1], [7], [40], [46], [55]. SNN is a type of artificial neural networks based on set units communicating with one another through spikes and perform computation as and when needed. However, a big drawback of these models is that they are not differentiable. For example, when multiple processing layers are involved, the training procedure becomes much more complicated than the back-propagation algorithm used in conventional neural networks. However, it is difficult to properly train an SNN with gradient descent although to overcome this, Garrick et al. [20] used predefined Gabor filters as network weights. Others have also proposed to first train a CNN and then to convert weights and transpose these to an SNN [7], [46]. Cannici et al. [6] added the attention mechanism in an event-based YOLO structure, namely YOLE, to solve the event-based classification problem. Unfortunately, the network conversion process impacted on performance which was then lower than conventional CNNs.
The broad learning network we used belongs to the Functional Link Neural Network (FLNN), which was proposed by Klassen et al. [29]. FLNN is a variant of the higher order neural network without hidden units, and it has been developed into the random vector functional link network [9]. Different various improvements and models as well as successful applications of FLNN were developed due to its universal approximation properties. A comprehensive review of FLNN can be referred to [14]. Chen et al. [8] have presented an adaptive implementation of the FLNN architecture together with a supervised learning algorithm named rank-expansion with instant learning. The advantage of this rapid algorithm is that it can learn weights in a one-shot training phase without iteratively updating parameters. In addition, a fast learning algorithm proposed in [11] to find optimal weights of the flat neural networks (especially, the functional link network), which adopts the linear least-square method, and this algorithm makes it easier to update weights instantly for incremental input patterns and enhancement nodes. This paper begins with an introduction of the asynchronous spatio-temporal mapping way from event camera in Section III, followed by a description of the broad learning system in Section IV, where we describe an end-to-end network training method to handle the event-based data. The experiments results and a conclusion are in Section V and Section VI.

III. ASYNCHRONOUS SPATIO-TEMPORAL MAPPING A. EVENT STREAM
Consider an event-based sensor with pixel grid size M * N , an event which can be generated when the illustration changes in the pixel location. The ith event is defined as where e i consists of a location (x i , y i ), time (t i ) and polarity is the spatial location, t i is the timestamp, −1 and 1 represent OFF and ON events respectively. When an object (or the camera) moves, the pixels asynchronously generate events which form a spatio-temporal point cloud representing the object's spatial distribution and dynamical behavior. The OFF event means the brightness changes lower, and the ON event means higher. In a given time interval incr , the event camera will trigger a number of events: Due to their asynchronous nature, events are represented as a set. To use events in combination with a neural network it is necessary to convert the event set into a grid-like representation X . This means, we should find a mapping M : E → X between the set E and a presentation X . Ideally, this mapping way should maintain the structure (i.e., spatiotemporal locality) as well as event information.

B. EVENT-BASED CONVOLUTION
The event set could be defined in the view of signal spatial convolution [6], [34]. To derive a meaningful signal from the event stream, we should convolve the event data with a suitable aggregation kernel. The convolved signal thus be defined as (2) where f (x i , y i , t i ) is the event measure, and k(x, y, t) is kernel function. The prior works used several kinds of kernel. Alphakernel k(x, y, t) = f (x, y) et τ exp(−t/τ ) [34], the exponential kernel k(x, y, t) = f (x, y) 1 τ exp(−t/τ ). τ is set as the decay parameter. In fact, the exponential kernel is also used to construct the Hierarchy of Time-Surfaces [33] and Histogram of Average Time-Surfaces [52], where events are aggregated into exponential time surfaces.
After the operation of kernel convolution, we define a region of convolution to discredit the convolution form of Eq. (2) into a spatio-temporal coordinates as where the discrete representation is determined by a likely Voxel grid, c k , in the size of R * R. We use x l ∈ {0, . . . , K −1}, y m ∈ {0, . . . , R − 1}, and t n is the nth incr . Typically, the structure of the events contains information about the object and it's movement. The Voxel grid keeps track of not only the single event's activity but also each surrounding event so that the discrete Voxel-grid convolution provides strong local support for the event description.

C. PEAK-AND-FIRE MAPPING
Generally, kernel functions are based on task-dependent heuristics with no general agreement on which is the optimal FIGURE 3. The illustration of peak-and-fire mechanism. An event e i is generated after the light change is detected by the grid. The peak is calculated using Eq. (4) and the fire threshold µ t number of events is counted in the cell of R × R in incr . The peak is higher than the threshold will be fired as output and at the same time, the memory cell M c is updated.
kernel for maximizing performance. In the event-based object classification task, we design a peak-and-fire mechanism to regulate event data from discrete presentation to a robust presentation. We detect peaks of activity inside each event location and fire the peaks as outputs, as shown in Fig. 3. This additional step was inspired by the spiking recognition network proposed in [33], [52]. We adopted the exponential kernel in our classification task, because the exponential decay in the exponential kernel expands the activity of past events and provides information about activity history. The discrete representation in Eq. (3) can be defined as where P e i provides a dynamic temporal context at the location l i and t = t j −t i . τ is set as the decay parameter. Considering the polarity of each event, we only intercalate events in the same polarity. The peak in a Voxel grid e i , a peak in the grid is considered to be valid if it's value P c k is above the level of confidence in the interval, incr . We intercalate the components of the event peak in grid c k region, Eq. (4) as follows; where N c k is the number of events in c k . A peak in a cell is considered valid only if the number is greater than the confidence threshold in incr . The threshold is defined as µ t , where µ t = N c k /size(c k ) and size(c k ) = R × R. If N e i > µ t , the peak value in a cell c k is fired as an output, where N e i is the number of the events in l i . This adjustment suggests the output peak value is enabled to eliminate the influence of noises.
When we use the peak-and-fire mechanism in the incoming interval incr , every incoming event e i needs be iterated over all events in a past cell. Looping through the entire ordered event stream is extremely expensive and inefficient. We define a shared memory cell M c for every cell c k , shown in Fig. 3. The past events relevant for c k are stored. The output of c k is defined as When a new event arrives in c k , we update Eq. (6) by only looping through M c k , which contains only the relevant past events to compute the peak memory cell. After each interval, the output is X = [X c 1 , . . . , X c K ]. Hence, we compute the robust feature representation without significant increasing in memory requirements.

IV. END-TO-END BROAD LEARNING SYSTEM
In this section, we input the PFM result to a broad learning network, which is in the form of a flat network. Then, we propose extending the basic broad learning network by incrementally adding more nodes in a broad way. Finally, a low-rank orthogonal approximation is introduced to decrease the network redundancy during the broad extension phase.

A. REVIEW OF BROAD LEARNING SYSTEM
To build our incremental learning framework, we start with a traditional (i.e., non-incremental) BLS for classification. Our model is based on the BLS [10] providing an effective and efficient learning framework for classification and regression problems, given the training data set Here, X denotes the event data presentation in time interval incr described in Sec. III. In the BLS, the training samples are first transformed into n random feature spaces by feature mapping φ i as where the weights W f i and the bias term β f i are generated randomly with the proper dimensions. Then, we define the feature space of training samples as Z n = [Z 1 , Z 2 , . . . , Z n ], a collection of n groups of feature nodes. The outputs of the jth group of enhancement nodes are defined by where ξ i is a nonlinear activation function. We denote the outputs of the enhancement layer by H m = [H 1 , H 2 , . . . , H m ]. Therefore, the outputŶ of a BLS has the following form where A = [Z n , H m ] denotes the transformation features, and W is the output weight connecting the feature nodes and enhancement nodes to the output layer. W should be optimized by solving the minimization as follows; where λ is a small trade-off regularization parameter. The first term denotes the training errors and the second term VOLUME 8, 2020 controls the complexity of network structure and improve the generality. Then by setting the derivation of Eq. (10) is zero, we can obtain the solution of output weight as The calculation of BLS weights W can always be achieved, as the matrix (A T A + λI ) is generally nonsingular. Specifically, we have Once the training process is complete, the randomly generated weights W f i , W h j and the learnt output weight W should be fixed. Then, the BLS responses to testing samples x t aŝ

B. BLS ARCHITECTURE
The BLS is constructed based on the flatted functional-link networks. Please see Fig. 4 for an outline of the entire network structure. The network consists of feature nodes, enhancement nodes, and output nodes. Initially, event-based data are computed by PFM as the input of the network, which are then connected as feature nodes Z i by the weight W f and bias β f . The enhancement nodes H i are generated by the feature nodes and activation function according to the weight W h and bias β h . The output nodes are connected with feature nodes and enhancement nodes directly by W and β. So far, a general network architecture of the BLS is presented, although, the selection of functions for the feature mapping requires further attention. Functions φ(·) and ξ (·) have no explicit restrictions, which means that the common choices such as kernel mappings, nonlinear transformations, or convolutional functions are acceptable. Specifically, if we use the convolutional functions for feature mapping, the BLS network structure would be very similar to that of classical CNN structure, except that the BLS network has additional connecting links between the convolutional layers and the output layer.

C. END-TO-END LEARNING FOR BLS
In the BLS architecture, we know that there are six parameters should be trained, including W , β, W h , β h , W f , and β f . In the solution of the original BLS model [9], the weights W are computed through ridge regression approximation using Eq. (12). In [9], a pseudoinverse computation is used to solve the ridge regression approximation in the training. To extract sparse features from the given training data X , the weights W h and W f are calculated by a sparse autoencoder and solved by the Alternating Direction Method of Multipliers (ADMM) [22]. So, W h and W f are randomly generated, and not involved in the training process. In other words, only the weight W is trained in the increment training and therefore, the original solution in BLS is not an end-to-end learning system and the parameter solution is fragile. Once the number of samples or categories increases, the ability of feature representation will be weakened without updating W f , β f , W h and β h through the iteratively training.
In contrast to previous research, we propose using the gradient-descent learning method to update all parameters iteratively, which makes the event-based BLS be an endto-end learning system. Suppose that there are n mapping groups with K i feature nodes in the ith group, and there is one group of m enhancement nodes, given an input sample x = (x 1 , x 2 , ..., x M ) and the desired output y, we denote fkl is the weight connecting the lth input x l to the kth feature node in ith mapping group, b i f k is the bias term associated to the kth feature node in ith mapping group, w i ij is the weight connecting the kth feature node of the ith mapping group to the jth enhancement node, and b j is the bias term associated to the jth enhancement node. Then the qth BLS output denoted asŷ q iŝ where denotes the output of the kth feature node in the ith mapping group, and σ (·) is the activation function. We set σ (x) = 1 1+e −x , and it is trivial to obtainσ = σ (1 − σ ). After the softmax, we could obtain where c is the number of categories,ŷ presents the score for a particular class. We define the weight matrix connecting the outputs of feature nodes and enhancement nodes to the output neuron (i.e., the weights in the top layer) as where w iq k is the weight connecting the kth feature node in the ith mapping group to the qth output neuron, and w q j is the 45978 VOLUME 8, 2020 weight connecting the jth enhancement node to qth the output neuron. The error function between the actual output y and the model outputŷ is defined as follows: where y is a one-hot vector. Then, it becomes easy to deduce the following derivatives for parameters W , β, W f , β f , W h , and β h . The parameter W has two parts, one of which is w iq k and the other is w q j as described above. The derivatives of w Just for simplicity, we replace −ŷ·(1−ŷ) with φ in the follow equations. And ∂E ∂w The next step is to deduce the derivatives of w i jk and b j , and by using the chain rule of derivatives, we can obtain them as Through Eq. (14) we can easily obtain and the next step is to deduce the derivatives of w i fkl and b i fk .

D. INCREMENTAL LEARNING
In various applications, the original BLS network may be unsophisticated for learning. This may be caused by insufficient feature and enhancement nodes, rendering the system incapable of extracting enough underlying variation factors which are necessary to define the structure of the input data. In popular deep structure networks, when the existing model is incapable of task learning, the general practices are to either increase the number of the filters (or windows) or alternatively to increase the number of layers. However, these adaptive procedures suffer from more tedious, saltatory learning which requires resetting parameters for each new structure. Instead, in the proposed BLS, if the increment of new feature and enhancement nodes are required, the whole structure can be easily be constructed and incremental learning can take place without the need to reset and retrain the entire network.
Here, let us consider the incremental learning for newly incremental feature and enhancement nodes. Assume that the initial structure consists of n groups feature mapping nodes and m groups broad enhancement nodes. Considering that the (n + 1)th feature group nodes are added and denoted as where W e n+1 and β e n+1 are new weight and bias respectively. The corresponding enhancement nodes are randomly generated as follows: where W ex i and β ex i are randomly generated. Denote A m n+1 = [A m n |Z n+1 |H ex m ], which is the upgrade of new features and the corresponding enhancement nodes. VOLUME 8, 2020

A. DATASETS
We validated our approach using five different datasets. Four datasets were generated by converting standard frame-based datasets to events (i.e., N-MNIST [19], N-Caltech101 [19], MNIST-DVS [49] and CIFAR10-DVS [36] datasets) and a novel dataset, recorded from real-world scenes, i.e. the N-CARS dataset [52]. N-MNIST, N-Caltech101, MNIST-DVS, and CIFAR10-DVS are four publicly available datasets created by converting the popular frame-based MNIST [13], Caltech101 [35] and CIFAR10 [30] to an event-based representation. N-MNIST and N-Caltech101 were obtained by displaying each sample image on an LCD monitor, while an ATIS sensor was moving in front of it [19]. Similarly, the MNIST-DVS and CIFAR10-DVS datasets were created by displaying a moving image on a monitor and recorded with the ATIS camera [49]. N-Cars dataset is split in 7940 car and 7482 background training samples, and 4396 car and 4211 background testing samples. The duration of each sample is 100ms. MNIST-DVS contains 10000 samples, generated at three different resolutions, scale4, scale8, and scale16. We used 90% of the samples for training and 10% for testing in scale 4. The duration of each is approximately 2.3s. N-Caltech101 consists of 100 different object classes and a background class. Each category has between 31 and 800 images. The duration is approximately 300ms. In our experiments, 20% of the data was randomly selected for testing and the remainder was used for training. We found that the sample duration is not identical, but we extracted a single 100ms window of events, as input to our object classification framework. We used the incr = 100ms for all samples.

B. NETWORK SETTINGS
In the PFM formulation, the kernel's length is R = 7 and the memory cell's length is also set as R, which makes the peak value easy to calculate in the same size. τ is set as 10 6 in Eq. (4). For the N-MNIST dataset, we set the number of feature nodes and enhancement nodes as 100 and 11000 respectively, which are denoted as (100, 11000). For N-Caltech101, MNIST-DVS, CIFAR10-DVS, and N-CARS datasets, the networks are set as (100, 11000), (100, 5000), (60, 3000), and (100, 3000), respectively. In training, we set batch_size = 200. The learning rate is initialized at 10 −3 and decayed by a factor of 0.1 every 10k iterations. Typically, we can also begin the BLS network from a basic structure, such as 60 feature nodes and 1000 enhancement nodes. With the increment of the input event-based sample, we can add the feature nodes and enhancement nodes by the increment learning to reach reasonable network configurations, which is investigated in Sec. V-E.

C. COMPARISON WITH STATE-OF-THE-ARTS
We considered several state-of-the-art methods, including: H-First [20], HOTS [33], HATS [52], and SNN [1], [40]. For H-First we used the code provided by the authors online. For HOTS [33] and HATS [52], we used the results reported in [52]. These researchers used a linear SVM for classification. Given that no code was available for the SNN, we compared our results with a two-layer SNN architecture using predefined Gabor filters [5] and the result is reported in [52]. For a fair comparison, we used our asynchronous spatiotemporal mapping method and the original BLS model, the results were reported in [18]. The original BLS model was solved using a ridge regression approximation and the output weights W were computed through the a pseudoinverse computation in the training. Our full model adopted the end-toend learning method by the gradient descent algorithm, which is described in Sec. IV-C.
The results for the N-MNIST, N-Caltech101, MNIST-DVS, CIFAR10-DVS, and N-CARS datasets are provided in Table 1. We reported the results in terms of classification accuracy in different datasets compared with the ground truth. It was observed that our method has the highest classification accuracy in N-Caltech101, MNIST-DVS, CIFAR10-DVS, and N-CARS datasets. Particularly, our method achieves the large margin improvement in the more challenging datasets (i.e. N-Caltech101, CIFAR10-DVS, and N-CARS). From the results, we found that the large datasets, such as the N-CARS dataset, is hard for both the H-First and HOTS learning algorithms to converge to good feature representation. As an event-based feature extraction method, the HATS method delivers a competitive performance. This may be because the HATS method also adopts an exponential kernel and implements spatio-temporal regularization. However, the linear SVM classifier used limits the capacity to discriminate when the class of objects and data noise increase. We also observed that in N-MNIST and MNIST-DVS datasets, our model and HATS provide an equivalent performance, although our method outperforms HATS by approximately 8%, 6.4%, and 3.7% in N-Caltech101, CIFAR10-DVS, and N-CARS datasets, respectively.

D. COMPARISON WITH CNN-BASED METHODS
We also compared our method with the CNN models.
In the experiments, we chose LeNet5 [13], AlexNet [31], VGG-19 [51], Inception-V4 [53] and ResNet-34 [24] as the baseline CNN-based networks. Given that the format of the required input for these CNNs models is frame-based, we grouped spike events to frame form over a random time segment of 30ms. We also supplemented training with heavy data augmentation to avoid over-fitting. As such, we resized input images such that the smaller side was 256 and kept the aspect ratio, then randomly cropped, flipped and normalized 224 × 224 spatial samples of each resized frame. All CNN-based models were trained from scratch using stochastic gradient descent with momentum set to 0.9 and L 2 regularization set to 0.1 × 10 −4 , and the learning rate was initialized at 10 −3 and decayed by a factor of 0.1 every 10k iterations. Table 2 provides a summary of the results for the CNN-based methods. It was observed that our proposed method outperforms all the other methods. ResNet-34 model has the best performance among all the compared CNN-based methods. Though our model and ResNet-34 achieve the similar results on the N-MNIST and MNIST-DVS datasets, our model outperforms the ResNet-34 model by 3.1%, 2.5%, and 3.0% on N-Caltech, CIFAR-10, and N-Cars datasets, respectively. In particular, we saw that our representation is more suited for object classification than existing handcrafted features, such as HATS and HOTS, even when we combined these features with more complex classification models. This is likely due to HATS discarding temporal information, which, as we established, plays an important role in object classification. It is important to note, that compared to the state-of-the-art, our method can still work at high frame rates which is sufficient for high-speed applications. Table 5 presents the performance of the one-shot broad learning network and three different dynamic incremental constructions for the N-Caltech dataset. Here, we tested the classification ability using different incremental strategies. The one-shot network was set as 10 × 10 feature nodes and 5000 enhancement nodes. We did not use the incremental strategy in each update, but rather used a new broad learning network trained from inception. The number of enhancement nodes was increased by 2000. The first incremental network was then set as 10 × 6 feature nodes and 5000 enhancement nodes. Then, the feature nodes were increased dynamically at 10 each. The second incremental network was also set as 10 × 10 feature nodes and 1000 enhancement nodes. Each enhancement node was increased to 1000. The third incremental network was set as 10 × 6 feature nodes and 1000 enhancement nodes. The feature nodes are dynamically  increased from 60 to 100 at the step of 10 in each update, the corresponding enhancement nodes for the additional features are increased at 1000 each. The results of each update are provided in Table 5. We observed that the incremental versions have the similar performance compared to the oneshot construction. The performance of the first and the third incremental strategy was slightly lower than those of the one-shot network.

2) PARAMETER R AND incr STUDY
We tested the influence of the length of kernel R and the interval time incr to the classification accuracy. The visualization of the PFM outputs X according to different parameters is shown in Fig. 5, which demonstrates the PFM results in different parameter incr and R settings. It was observed that the local neighboring information could not be fully utilized when R is small. The edges are obscure and dim, which consists of the dotted lines; however, the influence of noise may be amplified in the texture and background when R is large. The influence of R on the classification accuracy is shown in Table 3. For each sample, events within a fixed time interval were randomly extracted to input to our object classification framework. In this study, we tested under various time intervals (i.e., 50, 100, and 200 milliseconds), to see their effect on accuracy and computation. The results are shown in Fig. 5. When extracting 100ms events from one sample, the model achieves the highest accuracy. Therefore, we opted for this setting in our report. Actually, there are the best parameters R and incr configurations for different datasets. For easy applying in experiments, we used the R = 7 and incr = 100ms for all datasets.

3) KERNEL STUDY
In the experiments, we found that the kernel function makes an important role in the classification system. This maybe VOLUME 8, 2020  because the kernel decides which kind of feature information is extracted from the event stream. Table 4 shows the results of different kernel setting. The compared kernel functions include Alpha, Trilinear, and Exponential, which are described in Sec. III-B. For an effective comparison, we did not add the peak-and-fire mechanism in three kernel models. Our model in Table 4 adopted the complete PFM method. Because our model is also a variation of the exponential kernel, we can conclude that the performance of our full model outperforms the exponential kernel model without the peak-and-fire mechanism. This mechanism is able to exclude the influence of noise in the event stream. The advantage became clear in the N-CARS and N-Caltech101 datasets. Also, we can see that the difference of three kernels is not large, because all three kernels encode the temporal information, which is key to the event-based classification task. Table 6 shows the time comparison between our method and the CNNs models, i.e., the LeNet5 [13], AlexNet [31], VGG-19 [51], Inception-V4 [53], and ResNet34 [24]. We analyzed the model complexity of the broad learning system. We used two main metrics to measure model complexity;  the input dimensionality and O is the output dimensionality. With regards to the number of parameters, the FLOPs in fully connected layer is (C in + 1)C out . Table 6 summarizes the comparison of the network complexity. As one can see our proposed model has a smaller number of weights and requires the less computation compared to deep CNN models. The main reason is that our model has only two-layers representation which is relatively compact, which in turn reduces the amount of data required because there is no convolution layer in our model, so our model requires fewer computation resource. For example, if our model is constructed by 10 × 10 feature nodes and 8000 enhancement nodes, the number of FLOPs is 0.01G Mac and the number of Parameters is 5.35K.

VI. CONCLUSION
In this study, we proposed an end-to-end broad learning network for event-based object classification. First, the peakand-fire mechanism was adopted to map asynchronous events as input data for the broad learning network. We then designed a flexible broad learning network consisting of feature and enhancement nodes to handle asynchronous event data, thereby providing an alternative strategy for dealing with data from event-based cameras, except for CNNs and SNNs models. The incremental learning strategy of extending the broad learning network by adding feature and enhancement nodes during training is efficient and there is no need for preliminary training. In the experiments, our model outperformed state-of-the-arts. The complexity and parameters of our model are relatively small which provides distinct advantages over other models although several issues emerged which must be addressed. In the future, we plan to extend the proposed object classification method by using a peak-region detection method which intercalates peak-cell detection. It is hoped this next logical step will improve object location using wild event data.