Leveraging per Image-Token Consistency for Vision-Language Pre-training | IEEE Conference Publication | IEEE Xplore