Skip to Main Content
The floating-point 4D vector dot product (DP4) unit is the key factor to the overall performance of embedded graphics engine. In this paper, an enhanced multi-functional DP4 unit with optimized single instruction multiple data (SIMD) architecture is proposed, in which basic vector multiplication, addition and comparison in 3D graphics applications can be combined with specific dot production. To reduce area overhead, several fundamental hardware modules in the proposed DP4 unit are multiplexed and vectorized for SIMD functions instead of additional discrete floating-point multipliers and adders. The proposed methods can also avoid significant critical path delay increase so that the high performance of entire shader applications can be maintained. Normalization and rounding logics for SIMD vector addition function running in parallel with fundamental blocks and not affecting the critical path are also implemented. The synthesized result shows that the proposed multi-functional DP4 unit provides high performance with no critical path delay increase while just costs only 17% increase in area. The design can also be fully pipelined with a latency of 4 cycles and 1 cycle throughput at 193 MHz clock speed.