ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval | IEEE Conference Publication | IEEE Xplore