An Empirical Study of Training End-to-End Vision-and-Language Transformers | IEEE Conference Publication | IEEE Xplore