A Cross-Modal Spatio-Temporal Interaction Network for Video Question Answering | IEEE Conference Publication | IEEE Xplore