Recent researches on parallelization of H.264 video decoders focused on fine-grain methods. These works led to designs having very short latencies and good memory usage. However, they could not reach the scalability of Group of Pictures (GOP) level approaches although assuming a well-designed entropy decoder which can feed the increasing number of parallel working cores. We would like to introduce a GOP-level approach due to its high scalability, mentioning solution approaches for the well-known latency and memory issues. Our design revokes the need to a scanner for GOP startcodes which was used in the earlier methods. This approach lets all the cores work on the decoding task. Although the performance on shared memory systems is subject to improve, we have observed a one-to-one linear speedup in parallel working nodes. We have tested our method using a cluster of 5 machines each having 2 processors with 4 cores. The decoding is 5 times faster when we run only one process in each machine, that is we saw one-to-one linear speedup when there is no memory shortage. We observed a maximum of 11 times speedup when using all of the 40 cores distributed among 5 machines.