Skip to Main Content
In this paper, we present a throughput-scalable parallel and pipeline architecture for high-throughput computation of multilevel 3-D discrete wavelet transform (3-D DWT). The computation of 3-D DWT for each level of decomposition is split into three distinct stages, and all the three stages are implemented in parallel by a processing unit consisting of an array of processing modules. The processing unit for the first level decomposition of a video stream of frame-size (M × N) consists of Q/2 processing modules, where Q is the number of input samples available to the structure in each clock cycle. The processing unit for a higher level of decomposition requires 1/8 times the number of processing modules required by the processing unit for its preceding level. For J level 3-D DWT of a video stream, each of the proposed structures involves J processing units in a cascaded pipeline. The proposed structures have a small output latency, and can perform multilevel 3-D DWT computation with 100% hardware utilization efficiency. The throughput rate of proposed structures are Q/7 time higher than the best of the corresponding existing structures. Interestingly, the proposed structures involve a frame-buffer of O(MN) while the frame-buffer size of the existing structures is O(MNR) . Besides, the on-chip storage and the frame-buffer size of the proposed structure is independent of the input-block size and this favors to derive highly concurrent parallel architecture for high-throughput implementation. The overall area-delay products of proposed structure are significantly lower than the existing structures, although they involve slightly more multiplier-delay product and more adder-delay product, since it involves significantly less frame-buffer and storage-word-delay product. The throughput rate of the proposed structure can easily be scaled without increasing the on-chip storage and frame-memory by using mo- e number of processing modules, and it provides greater advantage over the existing designs for higher frame-rates and higher input block-size. The full-parallel implementation of proposed scalable structure provides the best of its performance. When very high throughput generated by such parallel structure is not required, the structure could be operated by a slower clock, where speed could be traded for power by scaling down the operating voltage and/or the processing modules could be implemented by slower but hardware-efficient arithmetic circuits.