Language-guided Multi-Modal Fusion for Video Action Recognition | IEEE Conference Publication | IEEE Xplore