Energy-Efficient Object Tracking Using Adaptive ROI Subsampling and Deep Reinforcement Learning

Recent innovations in ROI camera systems have opened up the avenue for exploring energy optimization techniques like adaptive subsampling. Generally speaking, image frame capture and read-out demand high power consumption. ROI camera systems make it possible to exploit the inverse relation between energy consumption and spatiotemporal pixel readout to optimize the power efficiency of the image sensor. To this end, we develop a reinforcement learning (RL) based adaptive subsampling framework which predicts ROI trajectories and reconfigures the image sensor on-the-fly for improved power efficiency of the image sensing pipeline. In our proposed framework, a pre-trained convolutional neural network (CNN) extracts rich visual features from incoming frames and a long short-term memory (LSTM) network predicts the region of interest (ROI) and subsampling pattern for the consecutive image frame. Based on the application and the difficulty level of object motion trajectory, the user can utilize either the predicted ROI or coarse subsampling pattern to switch off the pixels for sequential frame capture, thus saving energy. We have validated our proposed method by adapting existing trackers for the adaptive subsampling framework and evaluating them as competing baselines. As a proof-of-concept, our method outperforms the baselines and achieves an average AUC score of 0.5090 on three benchmarking datasets. We also characterize the energy-accuracy tradeoff of our method vs. the baselines and show that our approach is best suited for applications that demand both high visual tracking precision and low power consumption. On the TB100 dataset, our method achieves the highest AUC score of 0.5113 out of all the competing algorithms and requires a medium-level power consumption of approximately 4 W as per a generic energy model and an energy consumption of 1.9 mJ as per a mobile system energy model. Although other baselines are shown to have better performance in terms of power consumption, they are ill-suited for applications that require considerable tracking precision, making our method the ideal candidate in terms of power-accuracy tradeoff.


I. INTRODUCTION
Object tracking is one of the most ubiquitous computer vision applications with a rich history in robotics, surveillance, and autonomous vehicles. Further, deep learning-based neural networks have accelerated progress in object tracking to The associate editor coordinating the review of this manuscript and approving it for publication was Adamu Murtala Zungeru . state-of-the-art performance. Due to this success, recent research has turned to embedded or energy-efficient object tracking where system constraints on power and latency are critical for extended deployment in the wild.
Image sensors are one of the most power-hungry components in embedded vision platforms, particularly in the case of applications like continuous object tracking. The analog readout circuitry of image sensors can consume 50-70% of the total energy in most modern mobile system designs [1], [2]. Furthermore, always-on vision cameras that duty cycle their sensing to save battery life are used in several applications [3], [4], [5]. Finally, surveillance cameras employ uninterrupted data capture and hence need to be energy efficient for prolonging battery life.
To address inefficient energy consumption in the processing of real-time video data, one important mechanism for image sensors is adaptive subsampling. Adaptive subsampling is the selective readout of regions of interest (ROIs) in sequential frame capture and switching off of other pixels in the image. Cameras in the market that can read out selective ROIs yield the resulting benefits: reduced image quantization, faster bandwidth, and improved energy efficiency. Energy per pixel and spatiotemporal resolution of the streaming images are inversely proportional, i.e. lower frame rates and image resolutions consume less energy [6].
In many applications, ROI-specific information is sufficient. For instance, only the ROI describing the target object is necessary for tasks like surveillance and autonomous driving. The objective of an adaptive subsampling method is to determine this correct ROI. The concept of adaptive subsampling is somewhat similar to predictive object tracking [7], in that an object's future position is inferred from previous images in both frameworks. While there has been a lot of research on object tracking, predictive tracking with adaptive subsampling is less studied in the literature. Previous methods such as [8], [9], [10], and [11] utilize deep neural networks such as RNNs and LSTMs to perform regression-based location prediction for predictive object tracking. However, all these methods rely on fully sampled frames to perform predictive tracking and do not robustly track objects when images are subsampled based on ROI predictions.
In this paper, we develop a robust predictive tracking method for the adaptive subsampling framework. In our pipeline, the image sensor will read out only the pixels comprising the ROI predicted by the tracking algorithm in the previous step, thus saving energy. We show that existing stateof-the-art trackers such as DIMP [12] and ATOM [13] are ill-suited for this new mode of energy optimization technique. These methods, originally designed for operating on fully sampled frames, adaptively generate the search region for the next frame by roughly extending the tracked bounding box in the current frame. This dramatically degrades the tracking performance of these methods when operating on subsampled images. Thus, we consider the objective of jointly performing adaptive subsampling and predictive object tracking and specifically design a method that facilitates both robustness of tracking and optimization of sensor energy consumption. This allows us to tradeoff between energy savings via adaptive subsampling and tracking performance. Most existing ROI selection techniques are performed offline and without the image sensor in the loop. In contrast, our method incorporates programmable sensors in the loop which are reconfigured by the network-determined ROIs, and we design an algorithmic pipeline tailored to such programmable sensors.
While tracking by detection is not a recent technique, the novelty of our method lies in the underlying energy optimization framework and our proposed neural network-based solution that enables preemptive ROI subsampling via predictive tracking. We utilize the tiny YOLO network for feature extraction [14] followed by an LSTM network [15] for the adaptive subsampling prediction and tracking. Novel to our design is the use of reinforcement learning to train the network, specifically the REINFORCE algorithm [11], [16], also known as the Monte Carlo Policy Differentiation, to converge to the optimal tracking and subsampling policy. Our contributions in this paper are as follows: • We develop a policy gradient method for learning image subsampling patterns which aid in ROI prediction. Our method can be integrated with ROI-capable cameras to improve image sensor energy efficiency.
• We propose a loss function based on target location and image subsampling which captures the dissimilarities between network predictions and corresponding target labels.
• We also show the efficacy of our network in the context of energy optimization by reporting potential energy savings and computational efficiency at test time. The proposed technique is evaluated on a variety of datasets and against conventional state-of-the-art object trackers. Note that we are developing it as a proof of concept to show that detection-based trackers can be utilized to maintain energy efficiency by tracking with subsampling. We also compare against a number of baseline algorithms coupled with Kalman filtering to endow them with predictive capability. Our method outperforms both state-of-the-art and baselines in terms of AUC and achieves significant energy savings owing to the adaptive subsampling component.
A preliminary version of this work has appeared in one of the author's doctoral dissertations [17]. In this paper, we have expanded on the ideas first presented in the thesis and provided extensive experiments to evaluate the proposed methodology. In the thesis work, one of the proposed approaches for adaptive subsampling is shown to be the policy gradient method. It is shown how an LSTM is trained using the REINFORCE algorithm and how it outperforms other existing baselines and achieves higher AUC scores across datasets. The effectiveness of the proposed method is also evaluated in terms of detection fidelity during long intervals between keyframes (frames that are not subsampled). It is also demonstrated by a generic power model how the proposed approach strikes the right balance between tracking accuracy and energy savings. On the other hand, this paper contains more extensive experiments and analyses. Unlike the thesis work, here we have analyzed the tracking performance of our method and the baselines on video sequences featuring occlusion and we have dedicated an entire subsection to the subject. We have also deployed a second energy model that reflects real-world data to a greater extent and we have reported the corresponding energy consumption numbers. Thus, we have performed a more comprehensive power study in this work than what has been shown in the thesis. In this paper, we have also shown the performance comparison of pretrained baseline models with models trained specifically on subsampled image frames -an important ablation study that is missing in the thesis. We have also provided additional qualitative, and visual results comparing the adaptive subsampling performance of our method vs. the baselines in terms of input videos of mild, medium, and hard difficulty levels.
The rest of the paper has been organized as follows: We have surveyed the related work in the area in Section II; the proposed method, network architectures, and algorithm details and specifics have been provided in Section III; datasets, baselines, and implementation details have been given in Section IV and all experimental results and accompanying explanations have been provided in Section V. Relevant visual results and explanations are given in Section VI and results for training baseline methods on subsampled frames are given in Section VII. Finally, the conclusions and general discussions are in Section VIII.
Several recent works also utilize regression-based target tracking methods [8], [44]. Dual-regression-based frameworks [45] and dual-margin models [46] have been shown to optimize both accuracy and robustness. Learning dual-level deep representation has been shown to be effective for infrared thermal tracking as well [47]. However, some of these methods are unable to utilize long-term temporal information efficiently. Recently, there has been an increase in the utilization of recurrent neural networks for target tracking [9], [48], [49]. In [48], authors utilize the regression capability of Long Short Term Memory (LSTM) to predict the target location. A similar model was proposed in [49], wherein authors utilized a recurrent neural network to predict top-left and bottom-right corners of a bounding box. However, this method utilizes localization error as the final cost function unlike [9], wherein a classification error averaged over all the frames is used. The object tracking methods closest to ours include [10], [11] and [50] in which CNNs are used for feature extraction, and RNNs with RL training techniques are used for visual target tracking. In [7], the authors introduce a new benchmark for predictive visual tracking that accounts for both performance and latency. While several authors have also proposed RL-based video tracking solutions [51], [52], [53], [54], [55], [56], [57], none of them, to the best of our knowledge, have developed and analyzed a technique that enables adaptive subsampling for image sensor energy efficiency in embedded systems.
The recent boom in autonomous and mobile platforms has triggered a need for a major design choice to make the object detection and tracking pipeline more energy efficient. In [58], the authors present an adaptive methodology wherein the embedded camera state duration is determined based on the speed of the tracked object. However, this method is not able to work well with strongly shadowed videos. In [59], the authors propose a software framework titled MARLIN which enables content-driven real-time tracking by switching between deep learning and light-weight techniques. But the method fails in instances when the neural network-based tracker is not triggered in time.

B. ROI ADAPTIVE SUBSAMPLING
In the majority of embedded and mobile platforms, image sensing is one of the major sources of energy expenditure which leads to inefficient battery usage. In [6], authors show how image sensor energy expenditure has an inversely proportional response to changes in pixel resolution and frame rate. To address this problem, we develop a content-driven adaptive subsampling strategy wherein the algorithm learns to read specific regions-of-interest (ROIs) to save energy using spatial subsampling. The notion of ROIs has been leveraged to accomplish image compression [62], [63]. Another work that closely resonates with ours is the objectness-based subsampling mechanism [64]. In [64], the algorithm detects ROIs by employing the objectness feature. However, the reference frame subsampling mask is used for ROI detection in consecutive frames. This fails to account for changes in the appearance of the object, which leads to erroneous tracking at least until the next reference frame comes in. In a similar vein, the authors have developed an adaptive subsampling strategy in [65] which utilizes a YOLO network for object detection and a Kalman filter for ROI prediction.
In [66], a predictive tracking-based adaptive subsampling method was proposed with a Kalman filter playing the role of the ROI predictor. Although our objective is very similar to theirs, we have investigated and implemented a method that has greater predictive power. We employ an LSTM network in lieu of the Kalman filter to make our future location predictions. An extension of the work [60] also showed how the ECO tracker coupled with the Kalman filter could be leveraged for greater ROI prediction accuracy. However, the LSTM network benefits from past temporal information encoded in its hidden units and is able to anticipate future The LSTM learns the optimal sensor mask-generation strategy based on joint bounding box and coarse-grained subsampling pattern prediction using reinforcement learning and uses the mask to obtain corresponding subsampled frames.
trajectories with a great degree of accuracy than the ECO tracker plus Kalman filter-based predictive algorithm (see Section V). In Table 1, we have summarized the baseline methods that we have implemented for evaluating our proposed technique. We have included a brief description of the methodology for each technique along with its highlights and limitations.

III. METHOD
The proposed method features a dual architecture for performing predictive object tracking and subsampling. Subsampling refers to the selection of only the region of interest (ROI) in the image based on the prediction made by the network. The algorithm exploits a policy gradient [11] strategy with a cost based on target object location and subsampling pattern (explained in subsection B) to guide the model training. The final objective is to obtain a trained network with predictive ROI capabilities.
The concept of keyframing is a critical design choice in our subsampling and tracking pipeline. This idea has been used extensively in video compression algorithms [67], [68] to reduce image storage memory bandwidth. Keyframing implies visual feature extraction on fully sampled frames occurs only after specific intervals termed as keyframing intervals. The underlying premise is that for the nonkeyframes, the image sensor samples specific pixels based on the ROI predicted by the network for maximal energy efficiency. On the other hand, we feed fully sampled keyframes through the pipeline at specific intervals for optimizing tracking performance. Given that we fully sample frames at specified intervals, the sacrifice we make in terms of energy savings is negligible compared to the increase in robustness. In our pipeline, the keyframing interval is user-defined and we have conducted extensive experiments to demonstrate the effect of specifying different keyframing intervals on the tracking performance. In Section V (subsection Keyframing), we have presented the tracking performance of our architecture for different keyframing intervals. The key idea here is that for applications where tracking accuracy is more important than energy savings, the keyframing interval would be shorter, i.e. image frames will be fully sampled more frequently. On the contrary, for low-power applications where energy efficiency is of paramount importance, the user would specify a longer keyframing interval, ensuring higher energy savings.

A. NETWORK ARCHITECTURE
The dual architecture comprises (1) a pre-trained Tiny YOLO network to extract feature representations from fully sampled and subsampled frames [14], and (2) an LSTM layer for ROI and subsampling prediction. This dual architecture is inspired from [10] wherein the authors have an observation network for extracting feature information followed by a recurrent network for location regression. However, to the best of our knowledge, ours is the first work that focuses on predicting subsampling masks via regression-based object detection. The two kinds of subsampling masks we use for training include 1. ROI-based subsampling mask and 2. subsampling mask created by using static grids of 7 × 7 on a video frame and turning them on or off depending on whether a portion of ground truth object lies in the grid or not (coarse-grained ROI). Thus, our proposed method differs from [10] both in terms of the use case and the design of the reward function used in the underlying RL algorithm. In [10], the reward is constructed from the bounding box regressions. We, however, propose a more robust reward function by leveraging both the ROI and the coarse-grained subsampling mask (see the Ablation Study of Subsampling Loss subsection in Section V).
In Figure 1, we show the network architecture and inference pipeline for a trained network. For the incoming frames, visual feature extraction is conducted using Tiny YOLO, VOLUME 11, 2023 41999 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. During training, the LSTM network receives feature information extracted by the YOLO as well as ground truth object location and coarse subsampling mask whenever the incoming frame is a keyframe. For all non-keyframes, only the feature vectors are provided as input to the LSTM. The temporal information pertaining to the object trajectory remains encoded in the LSTM's hidden units which aids in converging to an optimal tracking policy via reinforcement learning. Thus, the network learns to anticipate the spatiotemporal motion of the object and make ROI and subsampling mask predictions.
a state-of-the-art real-time object detector that has been used extensively for tracking applications. Tiny YOLO was selected based on our future goal of implementing the proposed pipeline in hardware. We will use Tiny YOLO/YOLO interchangeably throughout the paper. The extracted features are regressed using an LSTM to predict the bounding box and a coarse-grained ROI (subsampling matrix) for the next frame. At the next time step, a non-keyframe is subsampled according to the location information/ coarse-grained ROI predicted by the LSTM leading to pixels outside of the ROI getting switched off. The tiny YOLO extracts features from this subsampled image which are then passed through the LSTM along with the updated hidden state in which the past bounding box and coarse ROI information remain embedded. The prediction phase persists until the next keyframe comes in, at which point the LSTM network once again gets access to features extracted from a fully sampled image frame.

B. NETWORK TRAINING VIA REINFORCEMENT LEARNING
The proposed architecture leverages the YOLO-based deep learning pipeline for extracting improved quality visual features. Furthermore, it employs the state-of-the-art LSTM model to predict the object location. However, since our application at the inference requires sequential subsampled frame capture for longer durations, we train our network using REINFORCE algorithm to improve the tracking and subsampling performance in the long run. We formulate the problem of joint tracking and adaptive subsampling as a reinforcement learning task and utilize the REINFORCE algorithm [11], [16] to perform the prediction step. This REINFORCE algorithm-based training procedure is depicted in Figure 2. As presented in Figure 2, the fully sampled frames i.e. keyframe's feature vectors are concatenated with ground-truth location and subsampling mask for the consecutive frame. Unlike keyframes, non-keyframe feature vectors are not provided with any informative ground truth bounding box and subsampling mask in order to guide the network toward accurate target estimation, given only the initial location information.
In the terms of reinforcement learning for this framework, the LSTM plays the role of the RL agent, the environment is the visual world sensed through the image sensor, and the state is the current image frame (which may or may not be subsampled). The policy function is defined by the network weights (r = {r o , r rc }) where r o are the YOLO network parameters and r rc represents the LSTM parameters. The control set U represents all possible bounding boxes and subsampling masks that an agent can choose to minimize the cost. Thus the network's goal is to learn a policy function µ(u k | z 1:k ; r) characterized by network parameters r to determine a u k ∈ U (i.e. a bounding box and subsampling mask pair). This function depends on the past state-control trajectories up to time step k, where u ∈ U denotes the predicted bounding box and subsampling mask and i denotes the image frame/state for a given time step. The LSTM encodes the past information z 1:k in its hidden states encoded by r rc . Hence, the policy function relies on past interactions between the agent and the environment.
The policy function induces a probability distribution p over all possible state-control trajectories z k , and the optimization problem is restricted to a parameterized subsetP z ⊂ P z of distributions and can be reduced to p(z; r) where P z represents a set of probability distributions over Z. Hence, the final optimization problem that needs to be solved is as follows [69]: where, K is the end time step, g k is the cost at timestep k, F(Z ) represents the cumulative cost till the last time step. As shown in Equation. 1, the expectation of the cost-generating function with respect to the control space probability distribution is the primary optimization problem. For our problem of joint object tracking and adaptive subsampling, we have formulated the following cost function: where, u bb k = {x, y, w, h} denotes the bounding box predicted by the algorithm and is sampled from a multivariate Gaussian distribution with mean φ loc k (output of tracking algorithm) and fixed variance, gt bb k is the ground truth bounding box corresponding to state k, u sm k is the subsampling prediction sampled from a multivariate Gaussian distribution with mean φ sm k and fixed variance, and gt sm k is the corresponding ground truth subsampling pattern. The (1 − cos(u sm k , gt sm k )) term in the cost function guides the network to learn the accurate subsampling patterns based on the ground truth object location. The ground truth subsampling provides supervision for coarse-grained subsampling in order to zoom in on the frame region containing the object. The mean(| u bb k −gt bb k |)+max(| u bb k −gt bb k |) term guides the network to focus in on the zoomed-in region of the frame and get the finer object location with stronger precision.
Since the expectation operation requires an integral over a probability distribution defined by the inaccessible policy, we make the assumption that p(z; r k ) is a discrete probability distribution. Furthermore, since the control space U (i) is defined by a probability distribution, we can construct an episodic algorithm. This results in the following simplification in the gradient [69]: The variable b k is used to compensate for the high variance exhibited by the episodic outputs. and is the expectation of the reward function.

C. TESTING
At test time, the LSTM receives feature vectors of fully sampled frames from the YOLO only during the beginning of the keyframing interval. Once the prediction phase is activated, the YOLO network no longer receives fully sampled image frames and starts receiving non-keyframes i.e. subsampled frames ( Figure. 1). The LSTM trained on image features is able to track the target and produce the subsampling mask for consecutive frames partly due to the partial subsampled image features from the YOLO network along with the object trajectory the network has learned to track implicitly. As it transpires, our network is capable of making reliable predictions without receiving fully sampled image features at every time step.

IV. IMPLEMENTATION A. DATASETS
We have evaluated our algorithm on three different datasets: 1) TB-100 [70], 2) LaSOT [71] and 3) TrackingNet [72]. The video sequences comprising these datasets feature a wide variety of objects in motion including people, animals, vehicles, etc. We randomly split up the TB100 dataset into training and testing sets like in [10]. However, instead of creating the split within video sequences as in [10], we chose a random set of 81 videos for training. We also use 30 and 100 randomly sampled videos for training from LaSOT and TrackingNet datasets, respectively. The reasoning for these splits and also for using TB-100 for training is that the primary goal of our proposed method is not to compare the methods for conventional object tracking, rather it is to analyze the performance on subsampled video sequences and hence energy-accuracy trade-off. For fair comparison we also used the same splits for the baseline methods as well. We use only a subset of the main datasets during training in order to account for the complexity of the LSTM network on top of the computational complexity of the REINFORCE algorithm. We were further motivated to down-select the training videos to investigate the generalizability of our RL-trained LSTM network. Even with sparse training data, our model manages to anticipate the state-control trajectories remarkably well at test time. Note that there has been no overlap between training and test data in our experiments. Although there are instances where multiple objects are present per frame, we use the ground truth labels to perform single object tracking. Furthermore, to develop ground truth subsampling pattern masks needed for stronger supervision, we resized all videos to 448 × 448 and gridded the frames into 7 × 7 grids forming a total of 4096 patches in each image frame. Each patch is assigned a binary label of 0 or 1 depending on whether a portion of the ground truth object lies in the patch or not. Consequently, the 4096-D vector is used as a ground truth subsampling mask during training.

B. BASELINES
To validate the effectiveness of the proposed tracker, we compare the performance of our network against two types of FIGURE 3. Object tracking and subsampling with our method vs. the baselines. We select the same frame in a video sequence and display the sensor mask generated subsampled image obtained with our method as well as the baselines. The frame generated by our method is visually closest to the ground truth and indicates the best mAP and highest energy savings compared to the other methods.
baselines: (1) predictive trackers as well as (2) state-of-theart tracking architectures deployed at test time on adaptive subsampled videos.
For predictive trackers, we utilize baseline systems with a similar structure to ours, wherein we couple an object detector with a Kalman filter [73] as shown in [65], but we introduce variation by swapping out the tiny YOLO with various detectors in the pipeline. This approach was also developed by [7] as a type of new baseline for visual tracking algorithms. The various object detectors we utilize include the YOLO architecture [65], a Kernelized Correlation Filter (KCF) [24], a Distractor-Aware Tracker (DAT) [61], and Efficient Convolution Operators for Tracking (ECO) [74].
For state-of-the-art object trackers, we use two recent methods: Accurate tracking by overlap maximization (ATOM) [13] and Learning discriminative model prediction for tracking (DiMP) [12]. ATOM determines the target using high-level information during offline learning and then a dedicated component for classification is trained online to maximize the discerning capabilities of the network while dealing with distractors in the input scene [13]. DiMP is an end-to-end tracking architecture wherein both foreground and background information are leveraged for target prediction [12]. While both of these trackers are state-of-the-art, we show in our experimental results they are not well-suited for adaptive subsampling. For evaluating these methods against our proposed algorithm, we adapt their tracking methodology for the adaptive subsampling-based predictive tracking framework. To accomplish this, we have restructured the tracking setup for ATOM and DiMP such that they receive full image frames only when the image sensor is signaled for reading out a keyframe. In all other instances, the trackers receive subsampled frames wherein only the ROI pixels are read out. In the case of ATOM and DiMP, if frame i is a keyframe, the network outputs a bounding box around the object of interest. The ROI comprising this bounding box is then adopted as the ROI prediction for generating the sensor mask and for subsampling the next incoming frame i + 1. Thus, from frame i+1 until the next keyframe, the networks predict ROIs required for adaptively subsampling incoming frames for energy optimization. We refer to the ATOM and DiMP-based subsampling methods as Adaptive ATOM and Adaptive DiMP respectively. In this paper, we use ATOM and Adaptive ATOM interchangeably, with both terms referring to Adaptive ATOM. The same is applicable for DiMP and Adaptive DiMP.

C. IMPLEMENTATION DETAILS
We implement the proposed algorithm using the PyTorch framework. A learning rate of λ = 0.0001 was selected after careful empirical analysis during the initial training phase and the Adam optimizer was chosen. To train our network, we chose a keyframing interval of 11 based on the fact that a duration of 11 frames doesn't signify a huge change in the motion trajectory of the object while simultaneously being a good choice for maintaining energy savings/accuracy tradeoff. We have also conducted an ablation study demonstrating the effect of increasing the keyframing interval for the proposed method as well as the baselines.
On average, the network needs at least three days of training on the Nvidia GeForce RTX 2080 Ti graphics card in order to converge on a dataset. However, the test time implementation can be performed in real-time. Per-frame computation time on GPU using the proposed method during testing is approximately 3.4 ms, while the execution time for ATOM, and DiMP are approximately 1.3 ms and 1.2 ms respectively. The mean execution time for the DAT+KF, KCF+KF, ECO+KF, and YOLO+KF baselines averaged over the three datasets are 0.8, 41.2, 303.9, and 29.3 ms on CPU respectively. Although our network results in slightly longer per-frame computation, our superior tracking performance justifies the increase in latency.   Figure. 4 illustrate the effectiveness of the proposed tracker in terms of achieving high object tracking precision (mean average precision -mAP) while maintaining satisfactory energy efficiency in terms of image resolution. Tables 2 shows that our method outperforms all the baselines and achieves an AUC score of 0.5113, 0.4979 and 0.5177 on the TB100, LaSOT and TrackingNet datasets, respectively. Figure. 4 demonstrates that our method is able to maintain better or comparable tracking fidelity compared to the baselines over a given range of IoU thresholds. Furthermore, we achieve high energy savings in terms of ratio of pixels turned off per frame on all three datasets, as has been shown in Table 3. Note that our method is outperformed in terms of energy savings by most of the other methods with a keyframing interval of 11. This can be attributed to the fact that the other methods are not trained for adaptive subsampling and, therefore, are prone to missing target objects and switching pixels off inside the region of interest. This provides higher energy savings at the cost of deterioration of tracking performance. Hence maintaining the energy-accuracy trade-off is the key which is well achieved by our method. Figure 3 demonstrates the visualization of tracking and adaptive subsampling performed by our method as well as the baselines. Note that our network outperforms all baselines and manages to correctly identify the target location. Although the baseline methods, especially Adaptive ATOM, and Adaptive DiMP perform remarkably well when the input data is a keyframe, these methods start breaking on nonkeyframes, especially towards the far end of the keyframing interval. The frame depicted in Figure 3 is a non-keyframe and the baseline methods, being unsuited to adaptive subsampling cannot compete with our method which has been specifically trained using subsampling loss for ROI prediction in an adaptive subsampling setup. In Figure 3, the ''Ground-truth'' sub-figure (given in Figure 3(b)) shows the ground-truth subsampling mask, with the ''Fully Sampled'' sub-figure (given in Figure 3(a)) depicting the original scene with the target object enclosed with a red bounding box. The goal of the adaptive subsampling methods is to predict a subsampling mask as close to the ground-truth (given in Figure 3(b)) as possible. For some of the methods, e.g. Adaptive ATOM and Adaptive DiMP, the ROI predictions on the example frame are significantly off and it appears that both these methods have lost the target. The ATOM tracker proves to be ill-suited for the adaptive subsampling framework as the underlying feature detector relies on scene details and target-specific information, which are not completely available in subsampled image frames. Similarly, the DiMP tracker exploits both foreground and background information for achieving state-of-the-art tracking performance. In the absence of background scene details in an adaptive subsampling framework, the method fails to achieve comparable accuracy and proves to be ill-suited for the proposed energy optimization technique. On the other hand, our method has learned implicit subsampling mask patterns based on both visual and encoded temporal features and has learned an optimal policy for anticipating target trajectories over time. Thus, our method has managed to keep track of the target where the Adaptive ATOM and Adaptive DiMP methods have failed. For more information, please watch the tracking videos provided in the supplemental.

B. ABLATION STUDY OF SUBSAMPLING LOSS
We formulate our loss as a function of both the bounding box prediction loss as well as the subsampling loss. Analyzing the network performance on the test data by training the network TABLE 3. Energy consumption results for adaptive subsampling with a keyframing interval of 11. Our method vs. the baselines. KF refers to the Kalman filter and the baselines with the KF have been adopted from [60] as baselines for our proposed method. We report the power consumption numbers in terms of W as per a generic image sensing power consumption model from [6], [60]. We also report energy consumption in terms of mJ as per a conventional mobile system energy model from [75].

FIGURE 5.
Results for the keyframing experiment. We have swept the keyframing interval from 15 to 240 for our method and all of the baselines on the TB100 dataset and reported the mAP (IoU@0.5). It is evident that our method is able to maintain the tracking fidelity for a longer duration.
both with and without the subsampling loss, we observe the advantage of implementing the proposed loss function. After having trained the network for roughly the same number of epochs, we obtain a test mAP (IoU@0.5) of 0.3388 on the TB100 dataset without the subsampling prediction loss and a test mAP of 0.5262 with the subsampling prediction loss. This can be attributed to the fact that in the instances when the network is not able to converge to the correct ROI, the subsampling information may capture the correct location information. Essentially, the image frame's grid-wise division helps encode the moving objects' correct localization even for erroneous bounding box predictions.

C. KEYFRAMING
We refer to the fully sampled images that the network receives as the keyframes and the rest are referred to as subsampled FIGURE 6. Comparison of test AUC scores (keyframing interval = 11) with ATOM and DiMP trained using different strategies (1. when training data is fully sampled; 2. when training data is subsampled; 3. when subsampled training data is used to finetune the pretrained networks).
frames. After the first fully sampled frame is processed through the Tiny YOLO network+LSTM (update phase), we get the object location and subsampling mask prediction. For the consecutive non-keyframes, the LSTM accepts a feature vector extracted from a subsampled frame wherein the scene content inside the previously predicted bounding box and subsampling mask is read out from the sensor as the new input, with all other pixels switched off (prediction phase). A user-defined interval triggers the next update phase, i.e. the reception of the next keyframe. Note that a longer interval will result in higher energy savings at the expense of tracking precision. The interval can be adapted as per the fidelity needs of the application. Figure 5 depicts the effect of increasing the interval between the update stage and the prediction stage of the tracking algorithm. Comparing the effect of increasing the FIGURE 7. (a) Scatter plot demonstrating the accuracy vs. energy savings tradeoff as per the generic energy model from [6], [60] for the TB100 dataset (with a keyframing interval of 11). Our method provides the highest AUC score with satisfactory energy savings. (b) Scatter plot depicting the tradeoff between energy consumption and accuracy for the TB100 dataset (with a keyframing interval of 11). Our method provides the highest AUC score with satisfactorily low energy consumption as per a conventional mobile system energy model [75].
keyframing interval, it is evident that our method is able to maintain significant precision even at higher keyframing intervals. On the contrary, techniques like the KCF+KF and especially the ECO+KF, which are shown to perform noticeably well at lower intervals, cannot sustain that same performance at higher intervals. Further, the ATOM and DiMP methods start breaking down as the keyframing interval increases. The reason being, the frequency of fully sampled frame information has decreased with the increase of keyframing interval, and these methods don't work very well when there is a dearth of target-specific information. This proves the potential efficacy of our technique for applications where both tracking accuracy and energy efficiency are of prime importance. Note that the keyframing experiment was conducted on the TB100 test data. We attribute our network's improved performance even with a prolonged prediction phase on its ability to zero in on (even coarsely for longer keyframing intervals) the object's region. Figure 5 shows the effect of keyframing on mAP (IoU@0.5) for intervals of up to 240 frames. For most computer vision-based applications, an effective frame rate of 60 FPS affords satisfactory latency. Therefore, it is promising that our network retains its performance for the most part even at the 60 frames interval. This implies that it will not require the next fully sampled frame for an entire second when integrated with a real camera system. Thus extremely high degree of object motion for long durations would degrade our network performance similar to the baselines.

D. POWER ANALYSIS
Adaptive subsampling offers an energy-efficient solution whereby pixels are switched off outside of the ROI for nonkeyframes, thus saving energy. To estimate the energy savings we achieve with our RL tracking algorithm, we characterize the energy requirements of several CMOS image sensors based on analysis from [6] and [60]. Using the ROIs generated from the proposed algorithm, we assert that the image sensor can skip certain columns during the frame readout and read only the pixels from the predicted regions. Since fewer pixels would be read out, it would result in substantial power savings while sensing. We model power for sensors B1, B2 and B3 from [6] with resolution 3264 × 2448, 2592 × 1944 and 752 × 480 respectively.
Then the model equations show that the average power consumption is proportional to the image resolution [6]: where R represents frame rate (fixed at 30 fps), T exp is the exposure time (fixed at 0.05ms), N represents frame resolution, c 2 denotes static power consumption (fixed at for every sensor: B1: 159.0, B2: 93.0 and B3: 13.1), α 1 (fixed at for every sensor: B1: 4.0E − 06, B2: 8.2E − 07 and B3: 3.35E − 06) is a sensor intrinsic independent of resolution and f represents the optimal clock frequency dependent on resolution ( c 2 .N α 1 .T exp ) 1/2 . The power consumption in milliWatts is presented in Table 3. We see lower savings in sensors B2 and B3 since they have a higher resolution. We fixed parameter T exp to 0.05ms, which is typical for outdoor settings, and a frame rate R of 30fps.
In Table 3, we also report the energy consumption numbers for the different techniques being evaluated based on the mobile system energy consumption model proposed in [75]. As per the model, the per-pixel energy consumption of FIGURE 8. Results with a relatively easy case. Our network seems to be able to anticipate what the object size is going to be for most of the images frames throughout the video. As a result, it preemptively selects a larger bounding box than necessary towards the beginning. This is an example of when our network strikes the right balance between accuracy and energy savings.
the image sensor, SoC-DRAM communication unit, storage, and computation per MAC are 595, 2800, 677, and 4.6 pJ respectively. We estimate the total energy consumption of the imaging unit based on these numbers and the ratio of pixels switched off for each competing adaptive subsampling technique. Our method maintains comparable power savings while ensuring high subsampling accuracy.
We show the power-accuracy tradeoff of the competing methods in Figure 7, where accuracy is given by the AUC score on the TB100 dataset, power consumption is given in watt, and energy consumption is reported in terms of mJ. Figure 7(a) depicts the power-accuracy tradeoff as per the generic energy model proposed in [6] and [60] and Figure 7(b) demonstrates the energy vs. accuracy results in terms of the mobile system-specific energy model provided in [75].
As is apparent from the tradeoff plots, other methods rank higher in terms of energy savings but at the expense of tracking performance. On the other hand, our network, although slightly more power-hungry, offers superior tracking performance as it has learned the implicit subsampling patterns during training. Therefore, in terms of the energy-accuracy tradeoff, our method strikes the right balance and sustains good tracking accuracy with reasonable energy savings. Table 4 depicts the performance of our method with respect to the baselines when there is occlusion in the test data. It is evident from the table that our model fares better than most of the baselines. This may be attributed to the feature extractor in our pipeline, which prioritizes target capture over energy FIGURE 9. Adaptive subsampling plus tracking results for a more challenging case. The scene features low-light conditions and multiple objects. For the two frames presented, our network zeros in on the object of interest, albeit with a larger bounding box than necessary to ensure it doesn't lose the object through the course of the video seqeunce.

E. OCCLUSION
savings. On the occasions when the target gets occluded, the network has a predilection of generating larger bounding boxes enclosing the ROI instead of switching all pixels off. This sets it apart from the baselines, which have a tendency of zeroing out all pixels if the target object is lost. A more accurate appearance model characterizing the object of interest would better assist the LSTM to learn to track occluded objects with tighter bounding boxes. Figures 8,9 and 10 help visualize the joint adaptive subsampling and tracking performance of our proposed algorithm vs. the baselines. We select three videos of varying difficulty levels in order to demonstrate the effectiveness of the different algorithms. Figure 8 is a relatively easy case where there is a distinctive object of interest in the scene and a fairly uncluttered background. Our network seems to be able to predict how the object would manifest itself throughout the video sequence and thus, it settles on a bounding box size that would capture the entire object for most of the image frames. On the contrary, the baselines don't fare as well for a number of frames in this video sequence. The fact that the target appears at different angles owing to variations in motion results in the baselines losing track of the object at a number of time steps. Our network does not lose the object of interest for the given frames by making a more conservative choice with regard to the bounding box size, thus prioritizing tracking performance at the expense of energy savings. Figure 9 demonstrates a more difficult case where the scene features low-light conditions with multiple people appearing in the foreground. Our network identifies the object corresponding to the ground truth label in the two frames depicted . Results obtained for a video sequence of considerable difficulty. Our network has trouble detecting the ROI for the first few frames but is finally able to converge to the object of interest.

VI. VISUAL RESULTS
in Figure 9. However, it tends to draw larger bounding boxes enclosing the ROI. In direct contrast to this are the DAT+KF and KCF+KF baselines, which get sidetracked after the first few frames but keep favoring energy savings over target capture, derailing our attempts to correctly predict the target location. Figure 10 depicts a scene where a person is moving the object of interest -in this case, a coupon -around on top of a table. As is evident from the results provided for the two image frames, our network is initially confused as to what the object of interest is supposed to be but eventually manages to zero in on to the coupon. The insight here is that, once the person in the video starts moving the coupon around, the network senses the motion of the object and thus gets prompted to keep tracking the moving object. On the contrary, the baseline Kalman filter prediction fails to identify the ROI for the two frames depicted, resulting in poor accuracy as well as high readout energy consumption. The state-of-the-art networks -ATOM and DiMP -also fail to sustain their tracking performance under the adaptive subsampling setup and end up losing the object in the last frame shown.
Three example videos have been chosen for demonstration and the corresponding tracking videos with the various methods have been provided in the attached files in Supplementary. Note that the videos may seem to be flashing in some cases -that is caused by the abrupt jump in bounding box location when the trackers get to receive the keyframes (fully sampled frames) and are able to better locate where the ROI is present in the frame.

VII. PERFORMANCE COMPARISON WITH TRAINING OBJECT DETECTION ALGORITHMS ON SUBSAMPLED FRAMES
Deploying ATOM and DiMP in an adaptive subsampling setup raises the question as to whether they have been acclimated to subsampled frames in the first place. To answer  this question, we train both ATOM and DiMP on subsampled images. Furthermore, we also train a version of ATOM and DiMP wherein we fine-tune the pretrained models using subsampled images. Unsurprisingly, the new networks underperform compared to the older versions which had only received fully sampled frames as the input data. This can be attributed to the fact that ATOM requires offline learning of high-level information associated with the target objects. Training it on subsampled images downgraded the network performance seeing as subsampled frames, by their very nature, conceal significant background information. Furthermore, we compared these state-of-the-art trackers with the basic YOLO detector trained on subsampled images. The YOLO detector was able to achieve an AUC score of 0.018 when trained from scratch which is significantly worse than the AUC score when trained on fully sampled images as shown in the Table. 5.
Similarly, DiMP was introduced as a work that tackled the problem of target-background discriminability. Siamese networks operate with target feature templates and don't take background information into account, thus lacking discriminating powers with regards to the target object with respect to the background. To counteract this problem, DiMP leverages both the foreground and background information for more precise target prediction. Therefore, fine-tuning DiMP on subsampled images resulted in a dramatic degradation of tracking performance, seeing as subsampled frames suppress background pixels, and, by association, background feature information. The tracker performances relevant to the discussion have been shown in Figure 6.

VIII. DISCUSSION
In this paper, we have developed an RL-driven tracking method targeted toward energy-aware sensors. We have adopted predictive subsampling for generating sensor masks for ROI selection. In this way, tracking is performed while avoiding unwarranted power consumption. A pre-trained tiny YOLO network is deployed for feature extraction and an LSTM is trained as an RL agent. A policy gradient method achieves promising results and our network demonstrates robust tracking while subsampling as compared to the baselines. It also achieves a satisfactory reduction in power consumption.
In the future, we plan to employ superior scene understanding models in our pipeline to further enrich the visual information used to train the LSTM. This will enable the network to better handle more complex videos featuring occlusion and fast-moving targets. We also plan to investigate how to extend our RL method for multi-object tracking. We believe this work will pave the way for future research in the domain of reconfigurable image cameras, and the development of energy-efficient vision algorithms for these sensors.