I. Introduction
Modern handheld graphics processing units (GPUs) require various operations to get realistic graphics effects [1]. In [2], a multifunction unit is proposed for this purpose. However, it was a fixed-point unit and didn't deal with the matrix-vector multiplication, required for the frequently used geometry transformation in 3D graphics. In this paper, a 4-way 32-bit floating-point (FLP) unified matrix, vector, and elementary function unit is proposed. It operates on the FLP data to meet the specification of the standard API which requires more than 24-bit FLP precision [1]. It unifies matrix-vector multiplication (MAT), vector multiplication, division, square root, multiply-add, lerp, dot-product, cross-product (CRS), and elementary functions including trigonometric functions (TRGs), power (POW) and logarithm (LOG) with 2 variables in a single 4-way arithmetic unit. The MAT, CRS, lerp, and LOG are newly unified to the previous operation set [2] with little overhead. Although it operates on the FLP data, it uses logarithmic arithmetic for the internal arithmetic. Using this, it achieves power-and area-efficient unification and single-cycle throughput with maximum 5-cycle latency for all the operations except for the MAT, for which 2-cycle throughput and 6-cycle latency is achieved.