Architectures for the discrete wavelet transform (DWT) operate typically on a sequential input. This input consists of a single data value every clock cycle. While this can be efficient for 1-D applications, 2-D ones, such as image processing, suffer from the dimensional direction bottleneck of the separable 2-D filter. The block-based architectures greatly reduce these on-chip memory requirements. Though the block-based architectures may take more computation time, the work can be divided among several processors. This paper demonstrates how a block processing architecture can be achieved, with advantages for the 2-dimensional DWT; less memory and parallel computation. Though the 2-D DWT in particular is discussed, these ideas apply to multi-dimensional cases as well.