Skip to Main Content
In this paper, we describe the high-performance implementation of an optical-flow algorithm that takes advantage of the processor's architecture. Tuning the code, i.e., adapting it to take full advantage of the processor, is challenging, time consuming, and requires efficient programming at different levels but can lead to significant improvements in performance. The optimized implementation presented here is highly interesting for a number of applications since it delivers real-time motion estimations at high-image resolution on a PC or in an embedded system based on a general-purpose processor. In a 2.83 GHz Core 2 Quad PC, it achieves a speedup of 14 compared to our first code version and 2052.7f/s for the well-known 252 times 316 Yosemite sequence, and a speedup of 17.6 and 68.5 f/s for a 1016 times 1280 sequence. But the description of how this high-performance is achieved goes beyond a specific application since the paper presented here illustrates how inherently dense, low-level visual algorithms (pixel-wise computation) can be structured and improved to take full advantage of a standard processor. The implementation is compared with other hardware (based on FPGAs and GPUs) and software (based on clusters, PCs, and special-purpose processors) optical-flow implementations, showing that it outperforms them.