In this paper, we propose two methods, a contiguous diagonal parallelization and a fused execution of macroblock (MB) processing, to improve the parallel processing efficiency of GPU H.264 motion estimation. The first method can increase the degree of MB level parallelism to 2x larger with keeping coding efficiency, by relaxing the weak data dependencies. The second one can accelerate the motion estimation by a factor of 1.3, by improving the parallel processing efficiency within each MB. The evaluation results show that the processing time is reduced to 13.2 ms/frame with only 8.2% bit-rate increasing against JM16.0. As a result, the proposed methods can contribute real-time full-HD encoding with sufficiently-small bit-rate increase.