I. Introduction
Video generation has gained substantial traction as a research topic. Researchers all over the world are working on many different methods to generate long range, high resolution, indistinguishably natural videos. The popularity is due to many critical advantages attached to it. Prominent fields that have great relevance to video generation are reinforcement learning [1] and motion planning of autonomous systems [2]. Although it is an interesting task, it is a very difficult problem to model, owing to its multi-modal nature and the exponentially growing tree of possibilities after the passage of initial frames. Video generation has numerous real-world applications which include interpolation of full sized videos from cropped videos and scene generation in video games. Further, it can also be utilized for generating videos which can serve as datasets for machine learning models. In robotics, video generation can play a vital role in handling occlusion, predicting object movement [3] and trajectory optimization.