MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning | IEEE Conference Publication | IEEE Xplore