SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding | IEEE Conference Publication | IEEE Xplore