Video Language Model Pretraining with Spatio-temporal Masking | IEEE Conference Publication | IEEE Xplore