Joint learning of images and videos with a single Vision Transformer | IEEE Conference Publication | IEEE Xplore