Event Augmentation for Contact Force Measurements

Neuromorphic vision sensor is an attractive technology that offers high dynamic range, and low latency which are crucial in robotic applications. However, the lack of event-based data in this field, limits the sensors’ performance in a real-world environments. In this paper, we propose a novel augmentation technique for neuromorphic vision sensors to improve contact force measurements from events. The proposed method shifts a proportion of events across the time domain, ’Temporal Event Shifting’, to augment the dataset. A new set of grasping experiments is performed to validate and analyze the effectiveness of the proposed augmentation method for contact force measurements. The results indicate that temporal event shifting is highly effective augmentation method which improves the models’ accuracy for the contact force estimation by thirty percent without performing new experiments.


I. INTRODUCTION
Vision-based tactile sensor is a category of optical sensors which aims to acquire tactile information by utilizing a camera [1], [2], [3], [4]. The camera is mounted on the robotic hand to capture images of the object's contact area, which are then processed to measure contact force, estimate force distribution, and predict object slippage. A wide range of sensors and robotic fingertips are designed to deal with various applications. Since the sensors have different physical properties, the data captured by sensors cannot be used for other sensors. Therefore, the datasets are often small and application specific in this field. On the other hand, the data collection process is a time-consuming and costly process. Therefore, alternative approaches such as simulation and synthetic data generation are studied. For example, simulation techniques have been adapted to increase the volume and diversity of datasets for training deep learning models [5], where the position and texture of the object were randomized. On the other hand, sim-to-real techniques aim The associate editor coordinating the review of this manuscript and approving it for publication was Tao Zhou . to transfer learning from simulations and adapt the model to the real environment [6]. However, less attention is paid to augmentation techniques for vision-based tactile sensors.
Ordinary image sensors capture the light intensity values of each pixel at a given framerate, normally within the range of 25-120Hz. Cameras with higher framerates up to 12kHz are available but they are expensive and their dynamic range may be reduced. On the other side, neuromorphic vision sensors (event-based cameras) capture intensity changes with low latency and high dynamic range. The main advantage of neuromorphic vision sensors are a low latency (few microseconds), high dynamic range (120dB) and low power consumption (5-12mW) [7], [8], [9], [10]. The high sampling rate and dynamic range of neuromorphic cameras enable the sensor to achieve a higher sensitivity and time resolution in robotic applications. The low latency of the sensory system allows the robot to feedback control signals in real-time in order to prevent failures [11], [12], [13]. In addition, the low power consumption of the sensor may enable robotic systems to perform longer with the limitations of batteries.
Computer vision has long been a key enabler of industrial robots, where it is used to guide and control the VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ positioning of the robot to achieve a high-level task.
As supported in the literature [12], [13], [14], [15], [16], the use of conventional frame-based cameras for robotic applications introduces several limitations on the maximum speed and process robustness due to several shortcomings of frame-based cameras such as motion blur, low dynamic range, latency, exposure timing, poor perception at low-light conditions and high-power consumption. These shortcomings of frame-based cameras impose constraints on robot operational speeds, workspace volumes, and ambient lighting conditions; which affect the robustness and productivity of robotic manufacturing processes. However, the use of neuromorphic cameras introduces new challenges regarding the unavailability of enough event data for robotic applications. Hence, in this paper, a novel augmentation-based DL technique was introduced to develop predictive contact force measurement models for neuromorphic vision sensors using limited measured data. The presented results show that the developed DL models can be considered promising tools in learning measurement from limited experimental data to make high-fidelity performance predictions. In our previous work [17], we proposed a novel vision-based tactile sensor using a neuromorphic vision sensor to estimate the contact force using deep learning techniques. A number of deep learning architectures and hyperparameters were studied whereas the deep learning model based on ConvLSTM layers achieved the highest accuracy for the contact force estimation. However, the experiments were conducted with the same object size and this approach cannot be generalized for objects with different sizes.
In this paper, we conduct a new set of experiments by considering three different object sizes. In addition, new augmentation methods are proposed to synthesize experiments for an unseen object size without performing real experiments. We demonstrate that the augmentation methods improve the neural networks' accuracy without performing further experiments. Our approach significantly reduces the cost and time for the data collection process by creating new synthetic datasets for unseen object sizes. In the proposed augmentation techniques, both 2D (imagebased) and temporal (time-domain) are investigated and a novel technique is proposed that shifts events along the time dimension to generate further synthetic samples. For evaluation purposes, all the augmentation methods are validated on the best deep learning architecture (ConvLSTM) from [17].
The main contributions of this paper are: • Developing time-domain and image-based augmentation for the neuromorphic tactile sensor for objects with different sizes.
• Proposing a novel event-based augmentation technique, ''Temporal Event Shifting'', to synthesize sequences and increase the model's accuracy.
• Performing new experiments with various object sizes to validate the effectiveness of augmentation methods considering ConvLSTM architecture proposed in [17].

A. RELATED WORK
Data augmentation techniques aim to generate synthetic data for training to improve model generalization. Augmentation methods can be divided into two main categories [18]: Model-based augmentation, and data manipulation. Model-based augmentation methods focus on training models to generate synthetic data from the real data such as Generative Adversarial Networks (GAN) introduced in [19]. Algorithmic data manipulation techniques apply fundamental operations to the data to generate realistic samples. For images, the geometric transformation of the training data such as rotation, translation, and shear has shown an improvement for classification tasks [20]. In [21], geometric translations and dropout layers are utilized to improve traffic sign recognition. The results indicate that the validation accuracy was improved by more than 5% considering rotation, translation, and shearing augmentation methods. In addition to spatial methods, other image-based augmentation techniques such as image distortion, morphological, and noise injection techniques have increased the networks' accuracy for image classification [22].
In the augmentation process, many variables are involved that can be tuned based on application and system characteristics. Some of the studies such as [23] proposed an automatic framework for data augmentation. The proposed approaches consider both feature-space and dataspace augmentation methods to generate synthetic data. To validate the augmentation methods, the models are trained multiple times to account for random initialization of weights. From another point of view, the effectiveness of refining the labels for augmentation is investigated in [24]. The authors demonstrate that algorithmic augmentation methods including the cropping technique may result in inaccurate labels for specific classes. Therefore, rules and conditions must be applied in the augmentation process by considering samples of each class independently.
Time-series augmentation methods consider time and frequency domain features to generate synthetic data. One of the common approaches in the time-domain is shifting inputs in regard to the ground truth to introduce a random delay in the sequence. In [25], signals are shifted randomly to make the model robust against unseen signals. Moreover, the authors considered a combination of pitch shifting in the frequency domain and time warping to improve the accuracy of the model for classifying environmental sounds. Window slicing is another popular approach in time-series classification which considers a sequence of the original signal during both the training and testing process of the model [26].
GANs are a class of machine learning models that includes two networks jointly trained to synthesize data. The first network (known as generative) learns to generate samples from a latent feature space while the second network (discriminator) identifies the realism of the produced samples. Although GANs achieved impressive results in [19], there is a lack of stability for training in practice [27]. Several studies have modified the GAN structure to improve the generated samples. For instance, a cascade CNN with pyramid (multi-scale) features is proposed in [28] which has produced high-quality realistic samples. In [29], a novel class of architecture, Deep Convolutional Generative Adversarial Networks (DCGAN), is presented to generate samples in an unsupervised manner. In addition to image generation, timedependent GANs are designed to capture temporal features and produce time-series samples. In [30], recurrent neural networks are employed in both generator and discriminator to produce continuous time-series samples. Similarly, recurrent conditional GAN is proposed in [31] and [32] with conditions in the time dimension to generate multi-dimensional timeseries samples. A comprehensive review of recent time-series GANs is provided in [33]. as However, training a GAN model requires a considerable number of existing data to achieve acceptable results. The training process for GAN is time-consuming and the results are required to be confirmed by human. Furthermore, interpretation of learning representations using GANs and deep learning models are difficult compared to the algorithmic augmentation methods [18]. Due to the lack of event-based data for grasping applications, we propose algorithmic augmentation methods to enrich data for the contact force estimation models. Algorithmic augmentation methods target handcrafted features to produce synthetic data based on logic and observations tailored for a particular application.
Evaluation of the augmentation methods is often performed on a validation set by using the augmented data in the training process. Since deep neural networks can easily overfit the training data, performance on the validation set provides a more intuitive evaluation. For instance, algorithmic and GAN augmentation methods are used in [34] to evaluate the effectiveness of each method for a classification task on the validation set. Similarly, various augmentation methods are proposed in [35] to classify medical images. The networks' accuracy is evaluated on the validation set to analyze the effectiveness of augmentation techniques.
Even though image augmentation techniques have been studied widely in the literature, few studies have been conducted to investigate event-based augmentation for any applications. In [36], dropout event augmentation techniques are proposed to drop events randomly, based on time and area of events. Authors demonstrate that such a technique leads to improved network accuracy using various event representations and datasets. Another study [37] proposed a mix of geometric augmentation including rotation, flipping, rolling, cutout, mix-up, and shear methods to augment events. This method shows a significant improvement in network accuracy for SNN and ANN networks. The event augmentation techniques can be applied directly on event streams, event-frames and other common event representations reviewed in [7]. This investigates image-based and time-series augmentation methods applied to sequences of event-frames in tactile sensing applications.

II. EVENT FRAME SEQUENCE AUGMENTATION
Events captured by neuromorphic vision sensors are characterized by location (x,y), timestamp (t), and polarity (p). Similar to the frame construction in [17], event frames are constructed by the accumulation of events over a time window while preserving spatial information. The accumulation of events is performed on positive and negative polarity events separately to construct two channels of the frame. This technique has been widely used to compress event data [38], apply image-based deep learning methods [39] and be compatible with standard hardware accelerations for images and sequence of frames.
The sensor has a dimension of 240 × 180 which covers the contact area and the background. To reduce the memory requirements of the system and the effect of the background noise, each frame is cropped to 140 × 150 pixels by considering the largest contact area size. Afterwards, the frames are downscaled to half (70 × 75) by adding the closest neighborhoods to a single pixel to reduce the frame size. Furthermore, the two channels are resized and then combined into one matrix to create the event frames. For visualization purposes, the image is populated with the created matrix considering red and green channels.
There is a trade-off between resizing the frame and maintaining spatial information of the events. In this application, pixel-wise information is not critical for the accuracy of the overall contact force estimation. Reduction of image size decreases the model inference and training time which is important for real-time applications. Figure 1 presents the cropping and resizing process over the two channels. After constructing of the frames, the augmentation methods are applied to generate further synthetic sequences for training the networks.

A. IMAGE-BASED AUGMENTATION
2D or image-based augmentation techniques aim to enrich the dataset to achieve a better generalization and eliminate biases in the dataset. For example, if experiments are captured within a specific range of object orientation, the rotation augmentation adds experiments with other object orientations VOLUME 10, 2022 to the dataset. Parallel grippers apply force on the object from both sides simultaneously, as shown in Figure 2. The object orientation remains the same through the grasp after the object stabilization. Assuming that objects have the same shape, two main features are varied between different objects: (i) Size; (ii) Contact area orientation. Both of the features can be augmented by affine transformations on the contact area images (event-frames).

1) ROTATION
The contact area orientation may vary across experiments. On the other hand, the object orientation remains the same in a stable grasp using a parallel gripper. Therefore, we considered the same rotation transformation for all frames of each sequence, instead of varying the transformation along the sequence. X t (x, y, p) represents the sequence of the original frames with spatial coordinates (x, y) with polarity p at timeframe t. For each experiment, the newly generated frames X t (x , y , p) are formulated according to Equation 1 while Equation 2 represents the rotation around the centre of the object (x o , y o ) by an angle φ.
The aim of this paper is to augment data for a grasped object with a different size than the ones used in the captured data. For example, training data includes experiments for a small and a large object while the sensor must estimate the contact force for any intermediate object size. In order to augment the images to the desired size (e.g medium size), the original images are required to be resized considering a specific scaling ratio β. The scaling ratio is determined based on the real object sizes where β > 1 and β < 1 for resizing to larger and smaller sizes respectively. We choose linear interpolation to assign values to the pixels. To preserve the same image resolution for all samples, a margin with zeros is added to maintain the image size. As x and y dimensions are scaled with the same ratio, the resized samples preserve the aspect ratio of the object contact area.

B. NOISE
To establish a noise model, a set of experiments are recorded without any movements in the scene. Afterwards, the triggered events are considered noise which is accumulated over a time window over two different channels. Finally, the noise frames are added to the frames in the original dataset to generate samples with artificially added noise.

C. TIME-SERIES AUGMENTATION
In the grasping process, a lot of parameters such as Dynamic Vision Sensor (DVS) threshold, silicone material, sensor hysteresis, and uncertainty cause a varying delay between the applied force and the triggered events. Time-series augmentation methods aim to generate synthetic sequences by considering transformations along the time dimension.

1) FRAME SHIFTING
One of the simplest augmentation techniques in the time domain is to shift the index of the frames by a certain value (j) while preserving the ground truth. This approach assists the network to deal with a slight lag between different sequences. Since shifting frames remove j frames from the input, new frames are required to be added to keep the sequence length fixed and are all set to zero values. Equation 3 presents the frameshifting process where the new frames are denoted as X t and j presents the shifting value. The frameshifting is applicable in both directions (i) Left: The frames are shifted to the earlier timestamps (j < 0); (ii) Right: The frames are shifted to the future timestamps (j > 0).
∀t, X t (x, y, p) = X t+j (x, y, p) 2) TEMPORAL EVENT SHIFTING Similar to frameshifting, we propose a novel approach to shift events across the frames, called ''Temporal Event Shift (TES)''. In fact, Frame Shifting is a specific case of temporal event shifting where all the events are moved to the previous or next frames. The proposed method selects a fraction ζ of events (0 < ζ < 1) randomly in each frame. These events are removed from the current frame and added to the next or previous j frames. Figure 3 demonstrates the procedure for temporal event shifting to the right while preserving the spatial information of events.
To shift the events to the past frames, j value is considered negative. This process is formulated in Equation 3 where the new frame is denoted as X t . ∀t, p, create a difference frame Z t (x, y, p) such as: x,y Z t (x, y, p) = ζ · x,y X t (x, y, p) X t (x, y, p) = X t (x, y, p) − Z t (x, y, p)+Z t+j (x, y, p) (6) Based on the formulation, frame shifting is a special case of temporal event shifting where Z t = 1.

III. EXPERIMENTAL SETUP
The experimental environment is not fully controlled to mimic real-world grasping applications and show the sensor performance under uncertainty. Real-life experiments are conducted on a Baxter robot including a F/T sensor, silicone membrane, DVS, and 3D-printed transparent planes. The transparent silicone membrane has 50 shore hardness and 8mm depth. Furthermore, the range of contact force is set to 0-25N which is significantly higher than the force range in [11], [17]. Figure 2(a) presents the experimental setup for the grasping task.
Three bolts with 12, 15, and 18mm diameters are used for the grasping process as shown in Figure 2(b). In this paper, all objects are painted in black to increase the contrast between the environment and the object's surface. Alternatively, a black silicone membrane with fixed lighting conditions can be considered like [40]. The fixed lighting conditions and DVS thresholds lead to standardizing the threshold of the event for all experiments in various environments.
This study aims to reduce the cost and time of the data collection process by investigating the impact of augmentation methods. Therefore, we assume that experiments for two object sizes are given while the network learns to estimate the contact force for unseen object sizes (e.g medium). The main reason for choosing medium size objects for validation and testing is to ensure the right interpolation between the smallest and the largest distribution. In practice, the collection of data for two sizes (smallest and largest) is applicable and other sizes can be augmented with the proposed method. Therefore, we choose the small and large bolts for training (48 sequences) while the medium bolt experiments (6 sequences for validation and 6 sequences for the test set) are considered for validation. Furthermore, the augmented data for the desired size (medium bolt) are added to the training set to evaluate the network performance and compare augmentation techniques. Figure 4 presents the force values recorded by the F/T sensor for the training (a) and validation sets (b).
Two configurations are set for the gripper to grasp the object with a different applied force. The experimental setup is not fully controlled which results in a slight variation of force between experiments with the same configuration. Therefore, a slight variation of force over time is visible.

A. PREPARATION OF FRAMES
The experiments have a maximum length of 360ms. In this paper, 36 frames are conducted for each sequence by the accumulation of events over a 10ms window. The frames are cropped to 140 × 150 to reduce the noise and eliminate the background which is selected based on the largest object contact area. Afterwards, the frames are resized to 70×75 pixels considering the accumulation of neighborhoods to reduce memory requirements. The resizing ratio is selected based on the maximum saturation level of each pixel over the time window. The force readings have a resolution of 2ms which is measured by the F/T sensor. After the synchronization, force measurements are read every 10ms to synchronize them with the frames.

B. TRAINING CONFIGURATIONS
We have studied various architectures including LSTM, CNN-LSTM and Convolutional LSTM architectures in [17] comprehensively. In this paper, we validate the effectiveness of the augmentation techniques on the best-performing architecture (ConvLSTM) only to have a fair comparison between the augmentation methods without changing the network architecture.
To select the hyperparameters, firstly we performed experiments on the original dataset to find the best optimizer, early stopping value and learning rate and ensure network convergence. Secondly, we train the model on each augmented dataset 10 times with a different random seed while keeping the same hyperparameters and network architecture to remove any influence of randomness. Finally, we evaluate the results by considering the average error accross the 10 trained models. Figure 5 presents examples of training and validation loss for different augmentation methods.
To ensure all the networks reach the stabilization point of training and validation loss, we set the early stopping parameter to 20 based on trial and error. Therefore, the training process finishes when the validation loss stops improving after 20 consecutive epochs. Adam optimizer is used to minimize the training loss (MSE) for the training set VOLUME 10, 2022 while monitoring the validation loss for selecting the best network. The training process finishes when the validation loss stops improving after 20 consecutive epochs. All the models are trained with the same configuration to provide a fair comparison. Keras framework is used to set the training configuration using an NVIDIA 1080 GPU. Figure 4 presents the training and validation set of the data set.

IV. RESULTS AND DISCUSSION
To evaluate the augmentation techniques, the training data size is doubled with the synthetic sequences while preserving the ground truth. Since the random initialization of weights affects the training process, the random seed is controlled for 10 runs. The final results are obtained by averaging the lowest error on the validation set using the same random initialization. Figure 6 presents the average of MSE for the validation set where the red line shows the standard deviation of MSE for the image-based augmentation methods.
The network trained without augmentation (No Augment) achieves MSE of 7.89N with STD of 2.09N. The standard deviation of more than 25% indicates instability of the training process with respect to random initialization. The rotation of images between 0 and 45 degrees (Rot45) provides a slight improvement in network accuracy. The best result using the geometric augmentation approaches is achieved by the resizing method for the desired object size. The scaling factor of resizing is considered as 1.25 and 0.83 for small and large objects respectively.
On the other hand, we consider background noise for further augmentation. The background noise includes both event polarities which are added to the original frames to double the training samples. The results indicate a slight improvement of 10% in MSE and the standard deviation is comparable to the networks that are trained without augmentation.  The results indicate that the MSE of the network is reduced to 6.05N and the standard deviation is decreased to 1.04N, a decrease of 50%. Therefore, resizing is the most effective image-based augmentation method, which makes sense as the challenge in our experiments was to train the networks for unseen object size.
Two time-series augmentation methods, mentioned in the section, are tested: Frame Shifting (FS) and the proposed Temporal Event Shifting (TES). In most of the experiments, the majority of events are fired within three frames (30ms) for the grasping and releasing phase. Therefore, our time-series augmentation considers a maximum shifting of 3 frames. For the FS method, j is varied between -3 and 3 to find the most effective value to shift the frames. Figure 7 presents the effectiveness of frameshifting augmentation with different j values. Shifting one frame to left (FS-1) results in the lowest MSE of 5.51N which is 30% less than the MSE achieved without augmentation. Furthermore, the STD of errors is reduced significantly to one fourth (0.41N) of the networks trained on the real data.
For the TES method, the fraction of the events to be moved (ζ ) is considered as 0.25, 0.50 and 0.75 with the same j variations as in the FS method. Figure 8 demonstrates the average MSE of the validation set considering different j and ζ . Among the TES augmentation configurations, two frames shift to the left with 50% threshold (TES-2(0.50)) resulting in the minimum MSE of 5.98N with 30% reduction of standard deviation (0.53N) compared to the results without augmentation.
In FS-based augmentations, the amount of new data generated is limited to one new sequence for the original sequence considering a fixed j value. On the other hand, in TES-based augmentations, the random seeds affect the selection of events, and as a consequence, an unlimited number of new samples can be produced for specific values of j and ζ . We produced an experiment to generate 480 synthetic sequences by varying the seed for TES-2(50) method. The results show that increasing the generated samples does not improve the network performance where an average MSE of 6.25N with 0.82N standard deviation is achieved. The main reason for this phenomenon is that the ground truth remains the same, despite the significant variation in the input.
Most of the augmentation techniques in the time domain improve the networks' performance. The main factors that affect the events along the time dimension are the F/T sensor hysteresis, the non-linear behavior of the silicone membrane, uncertainty, and vibrations. These factors are inevitable in real-world applications which show the benefit of the augmentation methods in the time domain.
A typical grasping task includes three phases: (i) Grasping phase is defined where the contact force increases to the maximum level (The first 5 frames); (ii) Holding phase includes a slight variation of force during the time from 6th frame to 30th frame. (iii) Releasing phase where the force values are decreased continuously to zero (The last 5 frames). Figure 9 presents the average of estimated force (blue) and ground truth (red) for two examples of the validation set over 10 runs. The top row (a,b) demonstrates the average of the force predictions for training without augmentation while the middle row (c,d) presents the average of estimated force considering FS-1 method. The bottom row (e,f) demonstrates the average of estimated force and ground truth using TES-2(0.50) method.
The results indicate that both frameshifting and temporal event-shifting augmentation reduce the standard deviation of the predictions in all three phases. In fact, the impact of random initialization is decreased by augmenting the training data. In Figure 9 (b) and (d), a clear improvement of the estimated force in the majority of the vibration phase is visible. Even though the frameshifting results in a lower MSE and standard deviation, the temporal event shifting method captures the maximum contact force (at 5th timestamp) more accurately in most of the cases.
In order to investigate the impact of augmentation methods on all the measurements, 12 predictions of 10 models are VOLUME 10, 2022 considered for grasping, holding, and releasing phases. The final results include 4320 points which are demonstrated in Figure 10. The black line presents the contact force measured by F/T sensor. The estimated force is presented by blue, red, and green for the grasping, holding, and releasing phases respectively. Figure 10 compares the estimated force using FS-1 and TES-2(50) augmentation techniques.
As observed in Figure 10, FS-1 and TES-2(50) augmentations improve the force estimation in the holding phase. Both augmentation methods shift events to the earlier frames to create synthetic sequences. The main reason for this phenomenon is that the number of triggered events increases significantly after applying a certain amount of force. Therefore, shifting the events to the left allow the network to relate more events to the contact force in the early frames. Furthermore, the silicone membrane has a non-linear deformation that absorbs a ratio of the contact force, particularly in the transition phases. The force absorption coupled with the F/T sensor hysteresis introduces a variable delay between the triggered events and the contact force.
The image-based and time-domain augmentation methods synthesize the training data from different perspectives. Therefore, a combination of both methods provides both spatial and time-domain variations in the generated samples. Since the best accuracy is achieved by resizing and FS-1, these two methods are combined to generate a new set of synthetic samples. There are two ways to combine the two methods: (i) Perform each augmentation method independently to generate synthetic sequences; (ii) Hybridise both augmentations methods on samples to generate a set of synthetic samples. The results indicate that independent augmentation of each sample achieves better accuracy than a simultaneous combination of methods. The independent sample generation method reduces the average MSE of the networks to 5.71N with a standard deviation of 1.06N which is slightly higher than FS-1 method. The hybrid augmentation method results in a high MSE of 7.20N with a standard deviation of 1.22N, a significantly higher error compared to FS-1 method. Figure 11 demonstrates the average MSE of the proposed augmentation methods where the standard deviation is highlighted as a red line.
In image-based augmentation techniques, resizing the object to a desired size results in the best accuracy. Since the network learns the relationship between the applied force and triggered events based on the contact area, resizing the training data simulates the experiments for the new size of an object.
A noticeable delay was observed in the releasing phase where the network always responds faster than the F/T sensor. In fact, the responding time of the silicone membrane has a significant impact on the delay between the triggered events and the contact force. For example in [11], we demonstrated that same shape objects with different elasticity generate a different number of events which can be used to classify the objects' material. Therefore, the augmentation methods in the time domain improve the network accuracy remarkably whereas FS-1 results in the lowest average of MSE.

V. CONCLUSION
This paper proposed a novel event-based method to generate synthetic data for vision-based force estimation considering spatial and temporal domains. The experiments are performed on three objects' sizes where the smallest and the largest objects are considered for training and the middle size object is used for testing. A novel augmentation technique event-shifting is proposed to generalize the network on unseen experiments. We demonstrated that algorithmic augmentation methods improve the network accuracy significantly without performing new experiments.