VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning | IEEE Conference Publication | IEEE Xplore