Visual-Linguistic Representation Learning with Deep Cross-Modality Fusion for Referring Multi-Object Tracking | IEEE Conference Publication | IEEE Xplore