Multimodal Emotion Analysis Based on Multi-Scale Feature Fusion and Cross-Modal Attention: Architecture Comparison and Feature Extractor Optimization | IEEE Conference Publication | IEEE Xplore