Abstract:
Video Object Detection (VOD) is one of the fundamental problems in video understanding with applications ranging from surveillance to autonomous driving. But many such re...Show MoreMetadata
Abstract:
Video Object Detection (VOD) is one of the fundamental problems in video understanding with applications ranging from surveillance to autonomous driving. But many such real-world applications are unable to leverage the existing VOD models owing to their higher computational complexity which reduces inference speed. Single-stage still-image object detection models are naively used without any use of video information. In this paper, we present YOLOX based VOD model, YOLO-MaxVOD, which provides a better trade-off between accuracy and inference time than the current real-time VOD solutions. Specifically, we propose a temporal fusion module that integrates within the YOLOX architecture to take advantage of the high speed that the YOLOX model offers. In our experimentation on the Imagenet-VID dataset, we show that YOLO-MaxVOD shows 4.4-5.6% AP50 improvement over the baseline YOLOX, across different versions, with just a 1-2 ms increase in latency on NVIDIA 1080Ti GPU.
Date of Conference: 08-11 October 2023
Date Added to IEEE Xplore: 11 September 2023
ISBN Information: