The large amount of the frame memory access and the die area occupied by the embedded internal buffer are the most critical issues for the implementation of the two-dimensional discrete wavelet transform (2D DWT). The former may consume the most power and waste the system memory bandwidth. The latter may enlarge the chip size and also consume much power. We categorize and analyze the 2D DWT architectures by different external memory scan methods. Then the overlapped stripe-based scan method is proposed to provide an efficient and flexible implementation for 2D DWT. The implementation issues of the internal buffer are also discussed, including the lifting-based and convolution-based. Some real-life experiments are given to show that the performance of area and power for the internal buffer is highly related to memory technology and working frequency, instead of the required memory bits only.