Impact Statement:Transformer structures are expert in delivering task-relevant clues via performing triplet mapping (query, key, and value) and nonlinear weighted aggregation. However, th...Show More
Abstract:
Recent transformer techniques have achieved promising performance boosts in visual object tracking, with their capability to exploit long-range dependencies among relevan...Show MoreMetadata
Impact Statement:
Transformer structures are expert in delivering task-relevant clues via performing triplet mapping (query, key, and value) and nonlinear weighted aggregation. However, the computation burden of a transformer-based tracker depends on the involved tokens, which impedes the implementation of multitemplate temporal modeling. To this end, inspired by the recent prompt engineering studies, we propose to transmit the historical target appearance via efficient prompt learning, balancing the tracking needs on both spatiotemporal modeling and speed. A modified window-attention structure is also designed to cooperate with the prompt design, which constraints the self-attention calculation in a local window. The experimental analysis demonstrates the merit of the proposed approach, in delving spatiotemporal appearance modeling via the novelly prompt learning design.
Abstract:
Recent transformer techniques have achieved promising performance boosts in visual object tracking, with their capability to exploit long-range dependencies among relevant tokens. However, a long-range interaction can be achieved only at the expense of huge computation, which is proportional to the square of the number of tokens. This becomes particularly acute in online visual tracking with a memory bank containing multiple templates, which is a widely used strategy to address spatiotemporal template variations. We address this complexity problem by proposing a memory prompt tracker (MPTrack) that enables multitemplate aggregation and efficient interactions among relevant queries and clues. The memory prompt gathers any supporting context from the historical templates in the form of learnable token queries, producing a concise dynamic target representation. The extracted prompt tokens are then fed into a transformer encoder–decoder to inject the relevant clues into the instance, thus ...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 8, August 2024)