SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model | IEEE Conference Publication | IEEE Xplore