Binary64 arithmetic is rapidly becoming inadequate to cope with today's large-scale computations due to an accumulation of errors. Therefore, binary128 arithmetic is now required to increase the accuracy and reliability of these computations. At the same time, an obvious trend emerging in modern processors is to extend their instruction sets by allowing single instruction multiple data (SIMD) execution, which can significantly accelerate the data-parallel applications. To address the combined demands mentioned above, this paper presents the architecture of a low-cost binary128 floating-point fused multiply add (FMA) unit with SIMD support. The proposed FMA design can execute a binary128 FMA every other cycle with a latency of four cycles, or two binary64 FMAs fully pipelined with a latency of three cycles, or four binary32 FMAs fully pipelined with a latency of three cycles. We use two binary64 FMA units to support binary128 FMA which requires much less hardware than a fully pipelined binary128 FMA. The presented binary128 FMA design uses both segmentation and iteration hardware vectorization methods to trade off performance, such as throughput and latency, against area and power. Compared with a standard binary128 FMA implementation, the proposed FMA design has 30 percent less area and 29 percent less dynamic power dissipation.