Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning | IEEE Conference Publication | IEEE Xplore