A systolic-like modular architecture is presented for hardware-efficient implementation of two-dimensional (2-D) discrete wavelet transform (DWT). The overall computation is decomposed into two distinct stages; where column processing is performed in stage-1, while row processing is performed in stage-2. Using a new data-access scheme and a novel folding technique, the computation of both the stages are performed concurrently for transposition-free implementation of 2-D DWT. The proposed design can offer nearly the same throughput rate, and requires the same or less the number of adders and multipliers as the best of the existing structures. The storage space is found to occupy most of the area in the existing 2-D DWT structures but the proposed structure does not require any on-chip or off-chip storage of input samples or storage/transposition of intermediate output. The proposed one, therefore, involves considerably less hardware complexity compared with the existing structures. Apart from that, it has less duration of cycle period in comparison to the existing structures, and has a latency of cycles while all the existing structures have latency of cycles, the filter order being small compared to the input size .