Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs and software platforms. As the floating-point computing capacity increases continually, the problem of ``memory-wall'' becomes more serious, especially for array-intensive applications. In this paper, we optimize and implement two SPEC2k benchmarks mgrid and swim on multithreaded GPU using CUDA and Brook+. In order to reduce the pressure on off-chip memory, we make use of data locality in multi-level memory hierarchies and hide long memory access latency via double-buffers. To balance inter-thread parallelism and intra-thread locality, we further tune thread granularity for each kernel and empirically study the best equilibrium point for this problem. Flow control instruction can significantly impact the effective instruction throughput. Oriented to this problem, we introduce a diverge elimination technology to convert condition expression into computing operation. Through all the optimizations, we gain the speedup of 10Ã-34Ã to the CPU implementation on the GPUs of AMD and NVIDIA respectively. Finally, we summarize and compares the GPUs from AMD and NVIDIA in hardware and software.