Non-anchor-based vehicle detection for traffic surveillance using bounding ellipses

Cameras for traffic surveillance are usually pole-mounted and produce images that reflect a birds-eye view. Vehicles in such images, in general, assume an ellipse form. A bounding box for the vehicles usually includes a large empty space when the vehicle orientation is not parallel to the edges of the box. To circumvent this problem, the present study applied bounding ellipses to a non-anchor-based, single-shot detection model (CenterNet). Since this model does not depend on anchor boxes, non-max suppression (NMS) that requires computing the intersection over union (IOU) between predicted bounding boxes is unnecessary for inference. The SpotNet that extends the CenterNet model by adding a segmentation head was also tested with bounding ellipses. Two other anchor-based, single-shot detection models (YOLO4 and SSD) were chosen as references for comparison. The model performance was compared based on a local dataset that was doubly annotated with bounding boxes and ellipses. As a result, the performance of the two models with bounding ellipses exceeded that of the reference models with bounding boxes. When the backbone of the ellipse models was pretrained on an open dataset (UA-DETRAC), the performance was further enhanced. The data augmentation schemes developed for YOLO4 also improved the performance of the proposed models. As a result, the best mAP score of a CenterNet with bounding ellipses exceeds 0.9.


Introduction
The success of traffic operation and maintenance totally depends on whether area-wide traffic surveillance can be secured without failure.Much effort has been invested in space-based approaches to obtain such a level of integrity for traffic surveillance.In this context, the most conventional way to measure traffic volumes and speeds utilizes an inductive loop detector embedded in the road surface [1]- [5].This approach, however, does not guarantee error-free detection and entails large maintenance costs.There are special detectors that utilize an ultrasonic beam [6], [7] and piezo-electronic sensors [8], [9] that can overcome the handicap of loop detectors, but these have failed to achieve an acceptable level of market penetration.Meanwhile, traffic surveillance methods are converging on the use of computer-vision schemes, as deep-learning technologies have improved the performance of vehicle detection.
Computer-vision technology had been widely adopted in traffic surveillance even before recent advances in deep-learning models.The most popular technology prior to deep learning has been the background subtraction method, wherein vehicles are recognized from a silhouette image created by subtracting pixel values of an image of concern from those of a background image [10]- [12].The success of the method depends on the accurate creation of a background image, and the most prevalent way to accomplish that has been to synthesize a virtual image using consecutive video frames taken at the same site.Each pixel adopts the average (or mode) of the corresponding pixel values for the video frames.A virtual background image, however, cannot be constant due to varying conditions of lighting, weather, and camera positions and angles.There have been many miscellaneous attempts to overcome these problems [10]- [15], but no ultimate solution has been found.The optical-flow method is another popular computer-vision technology that is used to detect vehicles and their speeds [16], [17].This method derives gradients of pixel values to trace the movement of vehicles between consecutive video frames in a mathematical manner.The optical-flow method had been relatively successful in recognizing moving vehicles, and, thus, in measuring traffic parameters, but its popularity has been diminished by the emergence of deep-learning technologies.
In the early 2010s there was a great leap in object detection owing to the advance of deep-learning technologies.The automated vehicle identification (AVI) model for traffic surveillance is rapidly converging to the adoption of deep-learning technologies [18]- [22].Details of this trend will be described in the next section with reviews of related work.The present study suggests a novel methodology to improve the performance of a deep-learning model that detects vehicles from video images taken for traffic surveillance.The key to this novel methodology is to represent vehicle images with bounding ellipses rather than conventional bounding boxes.It should be noted that this measure might not work as a general object detector that recognizes objects of various shapes.In the present study, we focused only on detecting vehicles for traffic surveillance, wherein vehicles take the shape of an ellipse since every camera for traffic surveillance is poll-mounted and provides images from an aerial view.Video frames for traffic surveillance include moving vehicles, and the moving direction of vehicles generally does not match either edge of the video frames.For this reason, bounding boxes are likely to include large empty spaces that are not occupied by vehicles.This redundancy is eliminated in this novel approach by using bounding ellipses around vehicles.A CenterNet detection model was employed to test the performance of bounding ellipses [23], since the model requires neither anchor boxes nor the non-max suppression (NMS) process for inference.The latter process would entail long computation times when used together with bounding ellipses.A SpotNet that incorporates an additional segmentation head to the CenterNet was also tested with bounding ellipses.Two other "state-of-the-art" object detection models (YOLO4 and SSD) were mobilized as references to confirm the superiority of the proposed approach.
The next section introduces cutting-edge technologies for deep-learning-based object detection.How "state-of-the-art" methods have been applied to traffic surveillance is also explained in the next section.The third section describes the architectures of CenterNet and SpotNet detection models and accounts for how the models should be revised to accommodate bounding ellipses.The fourth section explains how data are prepared and labeled for training and testing models.Several methods of data augmentation are introduced in the same section.The fifth section compares results based on the mean average precision (mAP) that has been adopted as a globally accepted performance measure for object detection.The last section draws overall conclusions and provides suggestions for further studies to enhance the detection performance.

Related work
The mainstream approach to traffic surveillance is rapidly converging to deep-learning-based computer vision technologies [18], [20]- [22].In computer vision studies, detecting objects in an image has long been regarded as a difficult task.The performance of deep-learning-based object detection models has, however, already surpassed human ability owing to the ever-growing advancement in deep-learning technologies.The "state-ofthe-art" detection algorithm can be categorized into two groups.Models that belong to the first category separate the region proposal task from the subsequent classification module.In the initial stage of developing these types of models, all potential regions in an image that might include an object are determined using a rule-based manner, and a learning model classifies objects for the proposed regions [24].Both tasks were integrated to a single framework later, and region-proposal models are also trained on data to distinguish the foreground from the background [25], [26].A Faster-RCNN is a two-stage model with region-proposals that has shown the best detection performance but is handicapped by a relatively long computation time for inference.
The second category includes one-stage detection models developed to speed up the inference time at the expense of deteriorating detection accuracy.One-stage detectors simultaneously conduct both the localization and classification tasks in an end-to-end manner.The YOLO series is the most popular form of the one-stage model [27]- [29].A YOLO model reduces the detection time by using a grid to divide the input image and assigning several anchor boxes to each grid for detection.An anchor box is a pre-defined bounding box, and its location and shape are adjusted during learning.Early versions of YOLO did not outperform the two-stage model in accuracy, but the latest version (YOLO4) has recorded equivalent, or even better, performances by reinforcing the model architecture and adopting diverse data augmentation schemes for training [30].
A single-shot multi-box detector (SSD) is another successful version of the one-stage model [31].The SSD also depends on anchor boxes to detect objects.The difference from a YOLO model is that a SSD can separately detect objects in an image by diffident scales.That is, several intermediate feature maps chosen from a deep neural network pipeline, each of which has a different resolution, are used to separately detect objects of different sizes, whereas a YOLO model uses only the last feature map for detection.This is why the title includes the word "multi-box".The RetinaNet approach constitutes another axis of the one-stage detection models [32].
RetinaNet first adopted a focal-loss approach to reduce the risk of over fitting by assigning different weights to each pixel according to the presence or absence of an object.
All the one-and two-stage models introduced above depend on anchor boxes, which can create complexity due to a large number of anchor boxes.The use of anchor boxes is also accompanied by the burden of determining many hyper-parameters such as the number, size, and shape of anchor boxes.Some researchers have developed a one-stage detection model that is free from anchor boxes.The CornerNet model uses a novel concept of key-points without the need to employ anchor boxes [33].With this approach, the two corner points of a bounding box are directly predicted based on focal losses.This scheme removes the necessity of an NMS process after inference.The CornerNet model, however, requires the additional task of matching the upper-left corners to their corresponding lower-right corners to constitute bounding boxes for detecting objects, which entails an exhaustive amount of computation time.Some researchers overcame the complication by developing a robust key-point-based object detector (CenterNet) [23].A CenterNet uses only a single key-point (=center point) to recognize an object.There are two distinct advantages for the model compared with the existing onestage detectors.First, the CenterNet does not use anchor boxes.Second, there is no need to implement an NMS for the final inference, which repeatedly computes the intersection over union (IOU) between the estimated and observed bounding boxes.As an extension of a CenterNet, some researchers developed a SpotNet model by adding a head for semantic segmentation to its architecture [34].When training the model, the head for semantic segmentation is fed with a silhouette derived by a background subtraction method using consecutive video shoots.
Among the two-stage detectors introduced above, a Faster-RCNN recorded the best performance and has been used as a vehicle detector for traffic surveillance [35]- [37].As a detector, however, the RCNN requires computation time so large that it cannot be used in real-time applications.Measuring vehicle speeds and traffic volumes in the field requires the tracking of each vehicle in a very short interval.Thus, the YOLO series has been widely adopted in studies of traffic surveillance [38]- [42].The performance of YOLO models ranges between 0.6 and 0.8 when measured by the mAP score.CenterNet and its extension SpotNet have also been adopted in studies of vehicle detection using the UA-DTRAC dataset [20], [34], [43]- [44].These two key-point-based detection models have recorded mAP scores exceeding 0.8, which is higher than those for any other detector.
In the present study, we chose the latter two detection approaches (CenterNet and SpotNet) to confirm the advantage of bounding ellipses, since they require neither the computation of IOUs nor a regression of the corner points of bounding boxes.The focal loss of the RetinaNet approach was applied to both models.The two models were also trained and tested with bounding boxes for comparison.The remaining one-stage detection approaches (YOLO4 and SSD) with bounding boxes were used as references to compare key-point-based approaches that used bounding ellipses.

CenterNet model of representing an object as a point
In the present study, we adopted a key-point-based detection approach (CenterNet) to verify the utility of replacing bounding boxes with ellipses.In this section, the architecture and loss function of the CenterNet approach is introduced, and modifications for using bounding ellipses are addressed.CenterNet can employ several different backbones such as ResNet-18, ResNet-101, DLA-34, and Hourglass-104.The present study adopted an Hourglass-104 because in an earlier study [23]   Consequently, after going through the backbone, the size of the input image was reduced to a smaller dimension that would be tractable in the subsequent detection process.
There was a modification in the present study when assigning ground-truth cell information to heatmaps.
The original CenterNet-filled cells around the center point of a vehicle had the Gaussian density of a single variance, which draws a circle regardless of the shape of the actual bounding ellipse.On the other hand, the present study fills the cells of a heatmap so that the orientation and shape of ground-truth bounding ellipses could be reflected.Fig. 2 displays three labels for a ground-truth bounding ellipse, which correspond to two lengths of the first and second axes and the orientation of the first axis.

Fig. 2 Generating ground-truth heatmaps
How to create a ground-truth heatmap based on the three labels is simple in theory.A multivariate Gaussian density function with a variance-covariance matrix can draw a region in a heatmap, which corresponds to a vehicle, but a variance-covariance matrix is not easy to derive and handle.Instead, there is an easier way to draw a ground-truth heatmap based on the three labels.An uncorrelated Gaussian density function assigns values to pixels within a range of long and short axes ( 1 /,  2 /) around a center point, and then the region is rotated by the orientation angle (=).Open libraries such as OpenCV, Pillow, and Matplotlib can be used to carry out this two-step procedure.Eq. ( 1) guarantees that a cell value that corresponds to a center point should have the maximum ground-truth value (=1).The maximum value could be lowered when label smoothing [40] or CutMix augmentation [41] is applied to the training data.
In Eq. ( 1),   ′ is the value of the (, ) position of a ground-truth heatmap for category  before rotation, and (  ,  ) is a center point in a heatmap and can be computed using Eq. ) according to the dictates of a Gaussian distribution so that cell values gradually diminish as they move away from a center point and become zero when reaching the bounding ellipse.The next step is to rotate the region around the center point using the labeled orientation angle.  denotes the cell value of a ground-truth heatmap after rotation.Making ground-truth heatmaps more consistent with the labels of bounding ellipses is a contribution of the present study when compared with the original CenterNet.
We devised a robust annotation tool that makes the drawing of bounding ellipses as simple as drawing bounding boxes.Fig. 3 intuitively shows the advantages of adopting bounding ellipses rather than bounding boxes.For most cases, a bounding box is less efficient than a bounding ellipse, since it may encompass large empty spaces.In addition, labels of the width and height of a bounding box cannot account for the actual size of a vehicle.On the other hand, a bounding ellipse can have labels that are consistent with the actual size of a vehicle.Moreover, a false intersection area is generated when vehicles are bounded with boxes, even though two vehicles do not overlap.No false intersections are generated when using bounding ellipses.These merits motivated us to replace bounding boxes with bounding ellipses.However, it should be noted that using bounding ellipses is effective only when adopting a key-point-based object detector wherein no IOU computation is required.Applying bounding ellipses to an anchor-based detector would increase the computation complexity.
Fig. 3 The superiority of bounding ellipses to bounding boxes for representing vehicles A focal loss was applied for both center-point detection and classification, which was first devised in the RetinaNet [33].Eq. ( 3) denotes the definition of focal loss used in the original CenterNet.
In Eq. 3,  ̂ and   are the predicted and ground-truth cell values of a heatmap, respectively. is the number of ground-truth bounding ellipses to be detected in a training batch of images.The focal loss is a variant of cross-entropy loss, wherein the presence and absence of vehicles are weighted differently in order to avert over fitting.For ground-truth center points, (1 −  ̂ )  assigns more weight to the loss of vehicle presence when the predicted intensity of a vehicle presence is low and grants less weight to the loss when the predicted intensity is high.For other points, ( ̂ )  exaggerates the loss of vehicle absence when the predicted intensity of vehicle absence is low and diminishes the loss when the predicted intensity is high.(1 −   )  is the weight that is used to decrease the loss of vehicle absence for non-zero cells in ground-truth heatmaps, which belong to the region within a bounding ellipse.Hyper-parameters  and  were set at 2 and 4, respectively, as in the original setup [33].
The other head of a CenterNet is an embedding for regression.The first regression loss is to adjust the positioning error of the center points due to the reduced size of the heatmap.This regression is inevitable because the center point is not predicted in an original image with the size of ( × ) but, rather, in a heatmap with the down-sampled size of ( ).The predicted offset is used to adjust a predicted center point in a heatmap, and then the adjusted coordinates are magnified to reproduce the corresponding center position in the original input image.For this offset regression, only cells in a heatmap that correspond to center points is considered.Eq. ( 4) denotes the offset loss to be minimized while training a model.The embedding dimension of the offset regression should be ( × ℎ × 2) to accommodate differences both in horizontal and vertical coordinates.
In Eq. ( 4),  ̂ is the predicted offset for the  ℎ center point,   is the target offset computed by ( The second regression is intended to match the size and orientation of bounding ellipses.This scheme is another distinction of the present study from the original CenterNet.The original CenterNet matches the predicted width and height of bounding boxes to the ground truth.This choice is inefficient, however, as illustrated in Fig. 3, because the width and height of a bounding box are not directly associated with the size of a target object.To find these dimensions, a deep net first must recognize the edges of a bounding box that face the end portion of an object, and then infer the location and size of the box.The present study adopted a loss function to directly minimize the difference in the axis lengths between predicted and ground-truth bounding ellipses.The mathematical meaning of finding the two axes of a bounding ellipse is to derive both eigenvectors of the covariance matrix for data points within the bounding ellipse.This computation is compatible with conducting the conventional principal component analysis (PCA), which can be easily accomplished using a neural network with a single hidden layer.
Two channels of embedding are necessary to accommodate both lengths of the two axes of a bounding ellipse.The orientation regression adds an additional channel, and, thus, the dimension of regression embedding becomes ( × ℎ × 3).If the previous offset channels are integrated, a total of 5 channels (=5) are necessary for the regression embedding.It should be noted that each regression loss is defined only for the positions of center points.Eq. ( 5) denotes the size and orientation loss.
In Eq. ( 5), ̂ and   are the predicted and ground-truth tensors with sizes of (3 × 1), each of which represents the two axis lengths and the orientation of bounding ellipses.
For a SpotNet, an extra head is added to accommodate the semantic segmentation.Whereas other heads are connected directly from the last feature map of the backbone network, the segmentation head up-samples the last feature map of backbone to generate an output feature map with the same size as the input image (see Fig. 1).The binary cross entropy loss is minimized for the semantic segmentation.Eq. ( 6) denotes the loss of segmentation.
In Eq. ( 6),  ̂ and   represent the predicted and ground-truth cell values, respectively, for the  ℎ position in an output feature map of a given size ( × ).
Finally, the total loss, ℒ  , is set up in Eq. ( 7) by integrating all losses introduced above.The last term in the total loss is included only when SpotNet is used.
In Eq. ( 7),   and  _ are the relative weights for offset and size losses set at 1.0 and 0.1, respectively, following the training scheme of the original CenterNet.

Testbed and data preparation
Deep learning models for vehicle detection have an advantage whereby the number of vehicle types to be classified is relatively small.Nonetheless, vehicle detection that can work everywhere does not exist at the current stage of traffic surveillance.Most previous studies have attempted to train their models on a site-by-site basis [36]- [38], and, at least, to fine tune each model with local data after pretraining on an open dataset [33].
For a new labeling task with bounding ellipses, we devised a simple annotation tool.Thus, the labeling task with this tool would be as easy as the task of bounding boxes.
The present study begins with the difficulty in securing a universal vehicle detector for traffic surveillance at real sites located in Bucheon, South Korea.At the initial state of the project, we expected existing "state-of-theart" object detectors to work well, because these had been trained on open datasets such as COCO [45] and PASCAL VOC [46].We realized, however, that such detection models cannot be fully qualified without being finetuned using local images on a site-by-site basis.This is the "status quo" of deep learning technologies for traffic surveillance.In the same context, labeling with bounding ellipses were conducted for a specific site.
The testbed was a signalized intersection with 4 legs located in Bucheon city.One thing that differentiates the present experiment from others is the use of a single fish-eye camera that covered all the intersection approaches in a single video frame.This scheme is more economical than other surveillance schemes that require a camera for every intersection approach.Figs.We also chose three schemes that encompassed label smoothing [48], Mosaic [30], and CutMix [49], each of which proved effective in enhancing the detection accuracy.Three augmentation schemes were independently and collectively tested for both bounding boxes and ellipses.Some mathematical tricks were necessary in order to apply CutMix to the bounding ellipses.Fig. 5 shows typical examples of the images augmented for training models in the present study.We applied MOSAIC and CutMix schemes to the training data off-line, and for each scheme we generated augmented images that equaled the number of original images.The original and augmented images with box and ellipse labels are available in http://00bigdata.cau.ac.kr/.

Testing the detection models
The proposed detection models were given a well-known backbone network.A double-stacked hourglass network was selected for both the CenterNet and SpotNet according to the results of previous work [23], [34].
For reference models, a CSPDarknet53 was used for YOLO4, and A VGG-16 network was used to set up an SSD, because these backbones recorded the best performance in previous works [30], [31].Basically, all proposed models were trained and tested based on data collected in the testbed.On the other hand, the UA-DETRAC dataset [47], which collects only images for vehicle detectors, was utilized to pretrain the models.Because only the backbone was extracted from models using bounding boxes for the next stage of fine tuning, CenterNet or SpotNet models with bounding ellipses were also able to use the pretrained backbones.Our experiment results showed that freeing all weights while fine tuning the models outperforms fixing the pretrained weights of the backbone.
For a fair comparison, all models shared a training algorithm and hyper-parameters.A stochastic gradient descent (SGD) algorithm was used with a consistent batch size (=8), whereby an Adam optimizer altered the learning rate while training models, and all the proposed models shared a stopping criterion.Each of the proposed models was trained on the same computing environment with a single GPU, which was a NVIDIA Tesla V100 with 32 GB of HBM2 memory.
We tested 4 different models based on the same dataset reserved for testing.Both the bounding ellipse and the box was tested for non-anchor-based detectors (CenterNet and SpotNet), and the remaining anchor-based models (YOLO4 and SSD) had to be tested only for bounding boxes.In order to improve the detection performance, we attempted 3 different augmentation schemes and also applied 3 different combinations of them to the training of each model.For inference with both CenterNet and SpotNet models, a confidence threshold to identify the vehicle presence was set at 0.3 as adopted in the original model.
Table 1 lists mAP scores from the test results when models were trained only on local data collected in the testbed.A CenterNet with bounding ellipses outperformed all other detectors when no augmentation skill was applied to the data.A YOLO4 showed the best performance for data augmentation schemes such as label smoothing, CutMix, label smoothing together with CutMix, and a combination of Mosaic and CutMix.This result is supported by the fact that the data augmentation skills introduced here were originally devised to enhance the performance of YOLO4.Nonetheless, the top mAP score was recorded when a Mosaic technic was applied to a CenterNet that depended on bounding ellipses.It is meaningful that a detection approach based on bounding ellipses outperformed YOLO4, which is the "state-of-the-art" object detector.When bounding boxes were used for detection, the performance of non-anchor-based models (CenterNet and SpotNet) was inferior to that of YOLO4 depending on anchor boxes.However, the models with bounding boxes outperformed a SSD, which is another mainstream detection model that uses anchor boxes.Table 2 shows the test results when each model was fine-tuned on local data after being pretrained on a UA-DETRAC dataset.The augmentation schemes were applied only to the local data for fine tuning rather than to the UA-DETRAC data.For baseline cases where the pretrained models were fine-tuned without data augmentation, the detection performance of all models was better than that without pretraining.Among baseline cases of pretraining, SpotNet recorded the best performance by utilizing an additional head for semantic segmentation.Although the original SpotNet was known to outperform CenterNet [34], a version of CenterNet yielded a higher mAP score than SpotNet when using bounding boxes.However, SpotNet showed a much better score when bounding ellipses were used.This confirms that adding information from semantic segmentation could be more helpful to detection with bounding ellipses than that with bounding boxes.As mentioned earlier, we constructed a ground-truth silhouette based on the union of bounding ellipses unlike the original SpotNet that used a background subtraction method.Another reason SpotNet showed the best performance was that matching segmentation maps with the original scale bridged the size gap between the original input and a reduced heatmap.
Regarding augmentation schemes applied to the pretrained models, the combination of Mosaic and label smoothing schemes led to the greatest improvement for CenterNet detection with bounding ellipses.A SpotNet approach with a CutMix technique recorded the second-best performance when bounding ellipses were used.
No combination of data augmentation schemes could produce a YOLO4 that was superior to the two nonanchor-based detection approaches.This verifies that non-anchor-based detection approaches using bounding ellipses outperformed "state-of-the-art" anchor-based models for vehicle detection.It should be noted that we do not claim that the proposed detection method is the best in every case.The superiority comes from the possibility that vehicles can be delineated using ellipses without margins.The superiority of the proposed method cannot be generalized for detecting various objects in a more complex shape.Even though a pretrained CenterNet had a good test performance for vehicle detection, a large dataset including more than 17,968 annotated images was used for fine tuning, validating, and testing.This required a great deal of human effort, and manually drawing the bounding ellipses for every site for traffic surveillance would not be sustainable.To reduce human effort, we conducted sensitivity analysis to identify how much local data are necessary to secure an acceptable level of accuracy.The relationship between the number of images used for fine tuning and the test accuracy is shown in Table 3. Fine tuning pretrained CenterNet and SpotNet models using about 8,600 annotated images acquired a performance that almost matched that of the models fine-tuned on the total number of training images.When only 30% of images were used for fine tuning, a mAP score higher than 75% was obtained using SpotNet.In the present study we conducted an experiment to quantify the speed of inferring the proposed models.
The inference time was measured according to the number of video frames that could be processed within a second.The average number of frames per second (FPS) is shown in Table 4.The experiment was conducted in a computing environment with a single GPU, which was a NVIDIA Tesla V100 with 32 GB of HBM2 memory.
Undoubtedly, the YOLO4 recorded the best speed for detection.CenterNet and SpotNet had a disadvantage in speed that was caused by a heavier backbone and a heatmap of relatively large dimensions.CenterNet and SpotNet processed 20 and 18 FPS, respectively.Nonetheless, such speeds are sufficient to track vehicles for use in traffic surveillance, because traffic surveillance has no vital issue.
it outperformed other backbones based on mAP scores.The head of CenterNet detection is composed of a heatmap and an embedding for regression, each of which minimizes a different loss function.The specification of each loss function will be addressed later in this section after describing the model architecture depicted in Fig. 1.The model architecture becomes a SpotNet if the shaded region of Fig. 1 is included.A semantic segmentation head was added to the two exiting heads of the CenterNet model under the expectation that the segmentation information would help detect and classify objects.Unlike the original SpotNet that acquired a background image from consecutive video frames, in the present study bounding ellipses were directly used to delineate the region for vehicle presence.

( 2 )
where (  ,  ) is the center point in an original image.(   ,    ) denotes bandwidths of horizontal and vertical axes, respectively.The bandwidths are set as ( and | | denotes a smooth  1 operator.At an inference time after training, a predicted center point for a heatmap is adjusted using the predicted offset   .

Fig. 4
Figs.4 (c) and (d) show example photos taken by a fish-eye camera in the testbed.Four images that cover the intersection approaches were cropped and later fed to detectors.

Fig. 6
Fig. 6 shows examples of success and failure in vehicle detection using CenterNet and SpotNet models.In successful cases, the two non-anchor-based detectors can identify vehicles with bounding ellipses that have no margins.The orientation of vehicles is also accurately determined.Most failures occurred when vehicles were doubly detected.Promising results are reaped when there are no false positives or missing vehicles.Both models also capably detected vehicles at nighttime.

Table 1 .
Test scores (mAP) for different detection models that were trained only on the testbed data

Table 2 .
Test scores (mAP) for different detection models that were pre-trained on the UA-DETRAC dataset

Table 3 .
Test performance (mAPs) according to the number of annotated images for fine tuning

Table 4 .
Inference speed of detection models