Fine-grained Text-Video Fusion for Referring Video Object Segmentation | IEEE Conference Publication | IEEE Xplore